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Concrete resource analysis of the quantum linear system algorithm 
used to compute the electromagnetic scattering cross section of a 2D target 
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We provide a detailed estimate for the logical resource requirements of the quantum linear system 
algorithm [Harrow et al., Phys. Rev. Lett. 103, 150502 (2009)] including the recently described 
elaborations and application to computing the electromagnetic scattering cross section of a metallic 
target [Clader et al., Phys. Rev. Lett. 110, 250504 (2013)]. Our resource estimates are based on the 
standard quantum-circuit model of quantum computation; they comprise circuit width (related to 
parallelism), circuit depth (total number of steps), the number of qubits and ancilla qubits employed, 
and the overall number of elementary quantum gate operations as well as more specific gate counts 
for each elementary fault-tolerant gate from the standard set {A, T, Z, TL, S, T, CNOT}. In order 
to perform these estimates, we used an approach that combines manual analysis with automated 
estimates generated via the Quipper quantum programming language and compiler. Our estimates 
pertain to the explicit example problem size N = 332, 020, 680 beyond which, according to a crude 
big-O complexity comparison, the quantum linear system algorithm is expected to run faster than 
the best known classical linear-system solving algorithm. For this problem size, a desired calculation 
accuracy e = 0.01 requires an approximate circuit width 340 and circuit depth of order 10^® if oracle 
costs are excluded, and a circuit width and circuit depth of order 10* and 10^®, respectively, if the 
resource requirements of oracles are included, indicating that the commonly ignored oracle resources 
are considerable. In addition to providing detailed logical resource estimates, it is also the purpose 
of this paper to demonstrate explicitly (using a fine-grained approach rather than relying on coarse 
big-0 asymptotic approximations) how these impressively large numbers arise with an actual circuit 
implementation of a quantum algorithm. While our estimates may prove to be conservative as more 
efficient advanced quantum-computation techniques are developed, they nevertheless provide a valid 
baseline for research targeting a reduction of the algorithmic-level resource requirements, imply¬ 
ing that a reduction by many orders of magnitude is necessary for the algorithm to become practical. 
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I. INTRODUCTION 

Quantum computing promises to efficiently solve cer¬ 
tain hard computational problems for which it is believed 
no efficient classical algorithms exist [I| . Designing quan¬ 
tum algorithms with a computational complexity supe¬ 
rior to that of their best known classical counterparts is 
an active research field . The Quantum Linear System 
Algorithm (QLSA), first proposed W Harrow et. al. Q, 
afterwards improved by Ambainis and recently gen¬ 
eralized by Clader et. al. is appealing because of its 
great practical relevance to modern science and engineer¬ 
ing. This quantum algorithm solves a large system of 
linear equations under certain conditions exponentially 
faster than any current classical method. 

The basic idea of QLSA, essentially a matrix-inversion 
quantum algorithm, is to convert a system of linear equa¬ 
tions, Ax = b, where A is a Hermitiar0 N x N matrix 
over the field of complex numbers C and x, b G C^, 
into an analogous quantum-theoretic version, A\x) = \b), 


* Corresponding author: arturschererl7@gmail.com 
^ Note that, if A is not Hermitian, the problem can be restated as 
Ax = b with a Hermitian matrix A := ( j’t p ), see Sec. Hill 


where \x ), \b) are vectors in a Hilbert space TL = (C^)®" 
corresponding to n = [log 2 N~\ qubits and A is a self- 
adjoint operator on 'H, and use various quantum compu¬ 
tation techniques [H, H, 0,0 to solve for |a;). 

Extended modifications of QLSA have also been ap¬ 
plied to other important problems (cf. 0), such as least- 
squares curve-fitting Q, solving linear differential equa¬ 
tions 0, and machine learning m- Recent efforts in 
demonstratin g s mall-scale experimental implementation 
of QLSA [l^, [l^ have further highlighted its popularity. 

A. Objective of this work 

The main objective of this paper is to provide a de¬ 
tailed logical resource estimate (LRE) analysis of QLSA 
based on its further elaborated formulation Q. Our anal¬ 
ysis particularly also aims at including the commonly ig¬ 
nored resource requirements of oracle implementations. 
In addition to providing a detailed LRE for a large prac¬ 
tical problem size, another important purpose of this 
work is to demonstrate explicitly, i.e., using a fine-grained 
approach rather than relying on big-0 asymptotic ap¬ 
proximations, how the concrete resource counts accumu¬ 
late with an actual quantum-circuit implementation of a 
quantum algorithm. 
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Our LRE is based on an approach which combines 
manual analysis with automated estimates generated via 
the programming language Quipper and its compiler. 
Quipper [alii is a domain specific, higher-order, func¬ 
tional language for quantum computation, embedded in 
the host-language Haskell. It allows automated quan¬ 
tum circuit generation and manipulation; equipped with 
a gate-count operation, Quipper offers a universal auto¬ 
mated LRE tool. We demonstrate how Quipper’s power¬ 
ful capabilies have been exploited for the purpose of this 
work. 

We underline that our research contribution is not 
merely providing the LRE results, but also to demon¬ 
strate an approach to how a concrete resource estimation 
can be done for a quantum algorithm used to solve a 
practical problem of a large size. Finally, we would also 
like to emphasize the modular nature of our approach, 
which allows to incorporate future work as well as to as¬ 
sess the impact of prospective advancements of quantum- 
computation techniques. 


B. Context and setting of this work 

Our analysis was performed within the scope of a larger 
context: lARPA Quantum Computer Science (QCS) pro¬ 
gram [l^, whose goals were to achieve an accurate esti¬ 
mation and moreover a significant reduction of the nec¬ 
essary computational resources required to implement 
quantum algorithms for practically relevant problem sizes 
on a realistic quantum computer. The work presented 
here was conducted as part of our general approach 
to tackle the challenges of lARPA QCS program: the 
PLATO projedH, which stands for ‘Protocols, Languages 
and Tools for Resource-efficient Quantum Computation’. 

The QCS program BAA presented a list of seven 
algorithms to be analyzed. For the purpose of evaluation 
of the work, the algorithms were specified in ‘government- 
furnished information’ (GFI) using pseudo-code to de¬ 
scribe purely-quantum subroutines and explicit oracles 
supplemented by Python or Matlab code to compute pa¬ 
rameters or oracle values. While this lARPA QCS pro¬ 
gram GFI is not available as published material, the 
Quipper code developed as part of the PLATO project 
to implement the algorithms and used for our LRE anal¬ 
yses is available as published library code [3,111. In 
our analyses, we found the studied algorithms to cover a 
wide range of different quantum computation techniques. 
Additionally, with the algorithm parameters supplied for 


^ The aspect of PLATO most closely aligned with the topic of 
this paper was the understanding of the resources required to 
run a quantum algorithm followed by research into the reduction 
of those resources. 

® The GFI for QLSA was provided by B. D. Clader and B. C. Ja¬ 
cobs, the coauthors of the work Q whose supplementary material 
includes a considerable part of that GFI. 


our analyses, we have seen a wide range of complexities 
as measured by the total number of gate operations re¬ 
quired, including some that could not be executed within 
the expected life of the universe under current predic¬ 
tions of what a practical quantum computer would be 
like when it is developed. 

This approach is consistent with the one commonly 
used in computer science for algorithms analysis. There 
are at least two reasons for looking at large problem sizes. 
First, in classical computing, we have often been wrong 
in trying to predict how computing resources will scale 
across periods of decades. We can expect to make more 
accurate predictions in some areas in quantum comput¬ 
ing because we are dealing with basic physical proper¬ 
ties that are relatively well studied. However, disruptive 
changes may still occuij^. Thus, in computer science, one 
likes to understand the effect of scale even when it goes 
beyond what is currently considered practical. The sec¬ 
ond reason for considering very large problem sizes, even 
those beyond a practical scale, is to develop the level of 
abstraction necessary to cope with them. The resulting 
techniques are not tied to a particular size or problem 
and can then be adapted to a wide range of algorithms 
and sizes. In practice, some of our original tools and tech¬ 
niques were developed while expecting smaller algorithm 
sizes. Developing techniques for enabling us to cope with 
large algorithm sizes resulted in speeding up the analysis 
for small algorithm sizes. 

Our focus in this paper is the logical part of the quan¬ 
tum algorithm implementation. More precisely, here we 
examine only the algorithmic-level logical resources of 
QLSA and do not account for all the physical overhead 
costs associated with techniques to enable a fault-tolerant 
implementation of this algorithm on a realistic quan¬ 
tum computer under real-world conditions. Such tech¬ 
niques include particularly quantum control (QG) proto¬ 
cols and quantum error correction (QEC) and/or mitiga¬ 
tion codes. Nor do we take into account quantum com¬ 
munication costs required to establish interactions be¬ 
tween two distant qubits so as to implement a two-qubit 
gate between them. These additional physical resources 
will depend on the actual physical realization of a quan¬ 
tum computer (ion-traps, neutral atoms, quantum dots, 
superconducting qubits, photonics, etc.), and also include 
various other costs, such as those due to physical qubit 
movements in a given quantum computer architecture, 
their storage in quantum memories, etc. The resource 
estimates provided here are for the abstract logical quan¬ 
tum circuit of the algorithm, assuming no errors due to 
real-world imperfections, no QG or QEC protocols, and 


At the time of ENIAG and other early classical computers, it 
seems unlikely that considering how the size of the computer 
could be reduced and its power increased would make us consider 
the invention of the transistor. Instead, we would have consid¬ 
ered how vacuum tubes could be designed smaller or could be 
made so as to perform more complex operations. 
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no connectivity constraints for a particular physical im¬ 
plementation. 

Determining the algorithmic-level resources is a very 
important and indispensable first step towards a com¬ 
plete analysis of the overall resource requirements of each 
particular real-world quantum-computer implementation 
of an algorithm, for the following reasons. First, it helps 
to understand the structural features of the algorithm, 
and to identify the actual bottlenecks of its quantum- 
circuit implementation. Second, it helps to differenti¬ 
ate between the resource costs that are associated with 
the algorithmic logical-level implementation (which are 
estimated here) and the additional overhead costs as¬ 
sociated with physically implementing the computation 
in a fault-tolerant fashion including quantum-computer- 
technology specific resources. Indeed, the algorithmic- 
level LRE constitutes a lower bound on the minimum re¬ 
source requirements that is independent of which QEC or 
QC strategies are employed to establish fault-tolerance, 
and independent of the physics details of the quantum- 
computer technology. For this reason, it is crucial to de¬ 
velop techniques and tools for resource-efficient quantum 
computation even at the logical quantum-circuit level of 
the algorithm implementation. The LRE for QLSA pro¬ 
vided in this paper will serve as a baseline for research 
into the reduction of the algorithmic-level minimum re¬ 
source requirements. 

Finally we emphasize that our LRE analysis only ad¬ 
dresses the resource requirements for a single run of 
QLSA, which means that it does not account for the fact 
that the algorithm needs to be run many times and fol¬ 
lowed by sampling in order to achieve an accurate and 
reliable result with high probability. 

C. Review of previous work 

The key ideas underlying QLSA iiS can be briefly 
summarized as follows; for a detailed description, see 
Sec. [ml The preliminary step consists of converting the 
given system of linear equations Ax = b (with x, b G 
and A a Hermitian N x N matrix with A^ G C) into the 
corresponding quantum-theoretic version A |a:) = |6) over 
a Hilbert space TL = (C^)®” of n = |’log 2 fV] qubits. It 
is important to formulate the original problem such that 
the operator A : "H —>• "H is self-adjoint, see footnote [TJ 

Provided that oracles exist to efficiently compute A 
and prepare state \b), the main task of QLSA is to 
solve for \x). According to the spectral theorem for self- 
adjoint operators, the solution can be formally expressed 
as |a;) = A~^ |6) = Y^^=iPj /^3 where \j and \uj) 
are the eigenvalues and eigenvectors of A, respectively, 
and 15) = 1% ) i® expansion of quantum state 

\b) in terms of these eigenvectors. QLSA is designed to 
implement this representation. 

The algorithm starts with preparing (in a multi¬ 
qubit data register) the known quantum state \b) us¬ 
ing an oracle for vector b. Next, Hamiltonian evolu¬ 


tion exp(—iAr/r) with A as the Hamilton operator is 
applied to |6). This is accomplished by using an ora¬ 
cle for matrix A and Hamiltonian Simulation (HS) tech¬ 
niques Q. The Hamiltonian evolution is part of the 
well-established technique known as quantum phase es¬ 
timation algorithm (QPEA) 0,0, here employed as a 
sub-algorithm of QLSA to acquire information about the 
eigenvalues Xj of A and store them in QPEA’s control 
register. In the next step, a single-qubit ancilla start¬ 
ing in state |0) is rotated by an angle inversely pro¬ 
portional to the eigenvalues of A stored in QPEA’s 
control register. Finally, the latter are uncomputed by 
the inverse QPEA yielding a quantum state of the form 

Ef=i Po v^i- c-VAl \u,) ® |0) + Ef=i CP,/\, \u,) ® |I), 

with the solution |a:) correlated with the value I in the 
auxiliary single-qubit register. Thus, if the latter is mea¬ 
sured and the value 1 is found, we know with certainty 
that the desired solution of the problem is stored in the 
quantum amplitudes of the multi-qubit quantum regis¬ 
ter in which \b) was initially prepared. The solution can 
then either be revealed by an ensemble measurement (a 
statistical process requiring the whole procedure to be 
run many times), or useful information can also be ob¬ 
tained by computing its overlap |(i? |a:)|^ with a particu¬ 
lar (known) state |i?) (corresponding to a specific vector 
R G C^lthat has been prepared in a separate quantum 
register Q. 

Harrow, Hassidim and Lloyd (HHL) Q showed that, 
given the matrix A is well-conditioned and sparse or can 
efficiently be decomposed into a sum of sparse matrices, 
and if the elements of matrix A and vector b can be ef¬ 
ficiently computed, then QLSA provides an exponential 
speedup over the best known classical linear-system solv¬ 
ing algorithm. The performance of any matrix inversion 
algorithm depends crucially on the condition number k 
of the matrix A, i.e., the ratio between A’s largest and 
smallest eigenvalues. A large condition number means 
that A becomes closer to a matrix which cannot be in¬ 
verted, referred to as ‘ill-conditioned’; the lower the value 
of K the more ‘well-conditioned’ is A. Note that k is a 
property of the matrix A and not of the linear-system- 
solving algorithm. Roughly speaking, k characterizes 
the stability of the solution x with respect to changes 
in the given vector b. Further important parameters to 
be taken into account are the sparseness d (i.e., the max¬ 
imum number of non-zero entries per row/column in the 
matrix A), the size N of the square matrix A, and the 
desired precision of the calculation represented by error 
bound e. 

In 0 it was shown that the number of operations re¬ 
quired for QLSA scales as 

O {t^d^ log(iV)/e) , (1) 

while the best known classical linear-system sol ving al¬ 
gorithm based on conjugate gradient method has 

the run-time complexity 

0{NdK\og{l/e)) , 


( 2 ) 
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where, compared to 0{-), the O(-) notation suppresses 
more slowly-growing terms. Thus, it was concluded in Q 
that, in order to achieve an exponential speedup of QLSA 
over classical algorithms, k must scale, in the worst case, 
as poly log(A^) with the size of the N x N matrix A. 

The original HHL-QLSA Q has the drawback to be 
non-deterministic, because accessing information about 
the solution is conditional on recording outcome 1 of a 
measurement on an auxiliary single-qubit, thus in the 
worst case requiring many iterations until a successful 
measurement event is observed. To substantially increase 
the success probability for this measurement event indi¬ 
cating that the inversion A~^ has been successfully per¬ 
formed and thus the solution |a;) (up to normalization) 
has been successfully computed (i.e., probability that the 
post-selection succeeds)^ HHL-QLSA includes a procedure 
based on quantum amplitude amplification (QAA) [^ . 
However, in order to determine the normalization factor 
of the actual solution vector |a;), the success probabil¬ 
ity of obtaining 1 must be ‘measured’, requiring many 
runs to acquire sufficient statistics. In addition, because 
access to the entire solution \x) is impractical as it is a 
vector in an exponentially large space, HHL suggested 
that the information about the solution can be extracted 
by calculating the expectation value (a;| M |a;) of an ar¬ 
bitrary quantum-mechanical operator M, corresponding 
to a quadratic form x^Mx with some M G ([^nxn j.gp_ 
resenting the feature of x that one wishes to evaluate. 
But such a solution readout is generally also a nontrivial 
task and typically would require the whole algorithm to 
be repeated numerous times. 

In a subsequent work, Ambainis proposed using 
variable-time quantum amplitude amplification to im¬ 
prove the run-time of HHL algorithm from O(K^logA) 
to 0 (k log^ K log A^), thus achieving an almost optimal 
dependence on the condition number However, the 
improvement of the dependence of the run-time on k was 
thereby attained at the cost of substantially worsening 
its scaling in the error bound e. 

The recent QLSA analysis by Clader, Jacobs and 
Sprouse (CJS) @ incorporates useful elaborations to 
make the original algorithm more practical. In partic¬ 
ular, a general method is provided for efficient prepa¬ 
ration of the generic quantum state \b) (as well as of 
\R)). Moreover, CJS proposed a deterministic version of 
the algorithm by removing the post-selection step and 
demonstrating a resolution to the read-out problem dis¬ 
cussed above. This was achieved by introducing several 
additional single-qubit ancillae and using the quantum 
amplitude estimation (QAE) technique |22l | to determin- 


® In Q it was also shown that the run-time cannot be made 
polylogffv), unless BQP=PSPACE, which, while not yet dis- 
proven, is highly unlikely to be true in computational complexity 
theory. Hence, because polylogfre) = o(k^) for all e > 0, QLSA’s 
run-time is asymptotically also bounded from below as given by 
complexity 


istically estimate the values of the success probabilities 
of certain ancillae measurement events in terms of which 
the overlap |(i? |a:)|^ of the solution |a;) with any generic 
state \R) can be expressed after performing a controlled 
swap operation between the registers storing these vec¬ 
tors. Finally, CJS also addressed the condition-number 
scaling problem and showed how by incorporating matrix 
preconditioning into QLSA, the class of problems that can 
be solved with exponential speedup can be expanded to 
worse than n ^ poly log(A^)-conditioned matrices. With 
these generalizations aiming at improving the efficiency 
and practicality of the algorithrm CJS-QLSA was shown 
to have the run-time complexitjjj 


O (^KdJ \og{N)/e^) 


(3) 


which is quadratically better in k than in the original 
HHL-QLSA. To demonstrate their method, CJS applied 
QLSA to computing the electromagnetic scattering cross 
section of an arbitrary object, using the finite-element 
method (FEM) to transform Maxwell’s equations into a 
sparse linear system 


D. What makes our approach differ 
from previous work? 

In the previous analyses of QLSA iii , resource 
estimation was performed using ‘big-0’ complexity anal¬ 
ysis, which means that it only addressed the asymptotic 
behavior of the run-time of QLSA, with reference to a 
similar big-0 characterization for the best known clas¬ 
sical linear-system solving algorithm. Big-0 complexity 
analysis is a fundamental technique that is widely used 
in computer science to classify algorithms; indeed, it rep¬ 
resents the core characterization of the most significant 
features of an algorithm, both in classical and quantum 
computing. This technique is critical to understanding 
how the use of resources and time grows as the inputs to 
an algorithm grow. It is particularly useful for compar¬ 
ing algorithms in a way where details, such as start-up 
costs, do not eclipse the costs that become important for 


® But note that, while the CJS run-time complexity [Eq.llSll] scales 
quadratically better in the condition number k than the original 
HHL complexity [Eq.(^], the former scales quadratically worse 
than the latter with respect to the parameters d and e. However, 
the two run-time complexities should not be directly compared, 
because the corresponding QLS algorithms achieve somewhat dif¬ 
ferent tasks. Besides, it is our opinion that the linear scaling of 
CJS run-time complexity in k is based on an overoptimistic as¬ 
sumption in its derivation. Indeed, while CJS removed the QAA 
step from the HHL algorithm, they replaced it with the nearly 
equivalent QAE step, which we believe has a similar resource 
requirement as the former, and thus may require up to 0{K/e) 
iterations to ensure successful amplitude estimation within mul¬ 
tiplicative accuracy e, in addition to the factor 0{K,/e) resulting 
from the totally independent QPEA step. See also our remark 
in footnote II II 
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the larger problems where resource usage typically mat¬ 
ters. However, this analysis assumes that those constant 
costs are dwarfed by the asymptotic costs for problems of 
interest as has typically proven true for practical classi¬ 
cal algorithms. In QCS, we set out to additionally learn 
(1) whether this assumption holds true for quantum al¬ 
gorithms, and (2) what the actual resource requirements 
would be as part of starting to understand what would 
be required for a quantum computer to be a practical 
quantum computer. 

In spite of its key relevance for analyzing algorithmic 
efficiency, a big-0 analysis is not designed to provide a 
detailed accounting of the resources required for any spe¬ 
cific problem size. That is not its purpose, rather it is 
focused on determining the asymptotic leading order be¬ 
havior of a function, and does not account for the con¬ 
stant factors multiplying the various terms in the func¬ 
tion. In contrast, in our case we are interested, for a 
specific problem input size, in detailed information on 
such aspects as the number of qubits required, the size of 
the quantum circuit, and run time required for the algo¬ 
rithm. These aspects, in turn, are critical to evaluating 
the practicality of actually implementing the algorithm 
on a quantum computer. 

Thus, in this work we report a detailed analysis of the 
number of qubits required, the quantity of each type of 
elementary quantum logic gate, the width and depth of 
the quantum circuit, and the number of logical time steps 
needed to run the algorithm - all for a realistic set of pa¬ 
rameters K, d, N and e. Such a fine-grained approach to a 
concrete resource estimation may help to identify the ac¬ 
tual bottlenecks in the computation, which algorithm op¬ 
timizations should particularly focus on. Note that this 
is similar to the practice in classical computing, where 
we would typically use techniques like run-time profiling 
to determine algorithmic bottlenecks for the purpose of 
program optimization. It goes without much saying that 
the big-0 analyses in Sii and the more fine-grained 
LRE analysis approach presented here are both valuable 
and complement each other. 

Two more differences are worth mentioning. Unlike in 
previous analyses of QLSA, our LRE analysis particularly 
also includes resource requirements of oracle implementa¬ 
tions. Finally, this work leverages the use of novel univer¬ 
sal automated circuit-generation and resource-counting 
tools (e.g. Quipper) that are currently being developed 
for resource-efficient implementations of quantum com¬ 
putation. As such our work advances efforts and tech¬ 
niques towards practical implementations of QLSA and 
other quantum algorithms. 


E. Main Results of this work 

We find that surprisingly large logical gate counts and 
circuit depth would be required for QLSA to exceed 
the performance of a classical linear-system solving algo¬ 
rithm. Our estimates pertain to the specific problem size 


N = 332,020,680. This explicit example problem size 
has been chosen such that QLSA and the best known 
classical linear-system solving method are expected to 
require roughly the same number of operations to solve 
the problem, assuming equal algorithmic precisions. This 
is obtained by comparing the corresponding big-0 esti¬ 
mates, Eq. (|31) and Eq. Q. Thus, beyond this ‘cross-over 
point’ the quantum algorithm is expected to run faster 
than any classical linear-system solving algorithm. As¬ 
suming an algorithmic accuracy e = 0.01, gate counts and 
circuit depth of order 10^® or 10^® are found, respectively, 
depending on whether we take the resource requirements 
for oracle implementations into account or not, while the 
numbers of qubits used simultaneously amount to 10® or 
340, respectively. These numbers are several orders of 
magnitude larger than we had initially expected accord¬ 
ing to the big-0 analyses in indicating that the 

constant factors (which are not included in the asymp¬ 
totic big-0 estimates) must be large. This indicates that 
more research is needed about whether asymptotic anal¬ 
ysis needs to be supplemented, particularly in comparing 
quantum to classical algorithms. 

To get an idea of our results’ implications, we note that 
the practicality of implementing a quantum algorithm 
can strongly be affected by the number of qubits and 
quantum gates required. For example, the algorithm’s 
run-time crucially depends on the circuit depth. With 
circuit depth on the order of 10^®, and with gate oper¬ 
ation times of 1 ns (as an example), the computation 
would take approx. 3 x 10® years. And such large re¬ 
source estimates arise for the solely logical part of the al¬ 
gorithm implementation, i.e., even assuming perfect gate 
performance and ignoring the additional physical over¬ 
head costs (associated with QEC/QC to achieve fault- 
tolerance and specifics of quantum computer technology). 
In practice, the full physical resource estimates typically 
will be even larger by several orders of magnitude. 

One of the main purposes of this paper is to demon¬ 
strate how the impressively large LRE numbers arise 
and to explain the actual bottlenecks in the computa¬ 
tion. We find that the dominant resource-consuming 
part of QLSA is Hamiltonian Simulation and the ac¬ 
companying quantum-circuit implementations of the or¬ 
acle queries associated with Hamiltonian matrix A. In¬ 
deed, to be able to accurately implement each run of the 
Hamiltonian evolution as part of QPEA, one requires a 
large time-splitting factor of order 10^^ when utilizing the 
Suzuki-Hi^er-Order Integrator method including Trot- 
terization^, [ 2 ^, |2^. And each single time step involves 
numerous oracle queries for matrix A, where each query’s 
quantum-circuit implementation yields a further factor of 
several orders of magnitude for gate count. Hence, our 
LRE results suggest that the resource requirements of 
QLSA are to a large extent dominated by the numerous 
oracle A queries and their associated resource demands. 
Finally, our results also reveal lack of parallelism; the 
algorithmic structure of QLSA is such that most gates 
must be performed successively rather than in parallel. 
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Our LRE results are intended to serve as a baseline 
for research into the reduction of the logical resource 
requirements of QLSA. Indeed, we anticipate that our 
estimates may prove to be conservative as more effi¬ 
cient quantum-computation techniques become available. 
However, these estimates indicate that, for QLSA to be¬ 
come practical (i.e., its implementation in real world to 
be viable for relevant problem sizes), a resource reduction 
by many orders of magnitude is necessary (as is, e.g., sug¬ 
gested by ^ 3 X 10® years for the optimistic estimate of 
the run-time given current knowledge). 

F. Outline of the paper 

This paper is organized as follows. In Sec. [H] we iden¬ 
tify the resources to be estimated and expand on our 
goals and techniques used. In Sec. Hill we describe the 
structure of QLSA and elaborate on its coarse-grained 
profiling with respect to resources it consumes. Sec. ITVl 
demonstrates our quantum implementation of oracles 
and the corresponding automated resource estimation us¬ 
ing our quantum programming language Quipper (and 
compiler). Our LRE results are presented in Sec. IVl and 
further reviewed in Sec. ED We conclude with a brief 
summary and discussion in Sec. I VIII 

II. RESOURCE ESTIMATION 

As mentioned previously, the main goal of this work 
is to find concrete logical resource estimates of QLSA as 
accurately as possible, for a problem size for which the 
quantum algorithm and the best known classical linear- 
system solving algorithm are expected to require a similar 
run-time order of magnitude, and beyond which the for¬ 
mer provides an exponential speedup over the latter. An 
approximation for this specific ‘cross-over point’ problem 
size can be derived by comparing the coarse run-time big- 
O estimates of the classical and quantum algorithms, pro¬ 
vided respectively by Eqs. © and ®, assuming the same 
algorithmic computation precision e, and the same k and 
d valuefl For instance, choosing the accuracy e = 0.01 
and presuming d ~ 10, yields the approximate value 
Amoroso ~ 4 X 10^ for the cross-over point. The specified ex¬ 
ample problem that has been subject to our LRE analysis 
has the somewhat larger size N = 332, 020,680, while the 
other relevant parameters have the values k = 10'^, d = 7, 
and e = 10“^. 

Logical resources to be tracked are the overall number 
of qubits (whereby we track data qubits and ancilla qubits 
separately), circuit width (i.e., the max. number of qubits 
in use at a time, which also corresponds to the max. num¬ 
ber of ‘wires’ in algorithm’s circuit), circuit depth (i.e.. 


^ Note that the run-time big-O estimates of the classical [Eq. 0] 
and the quantum [Eq. 0] algorithm both scale linearly with k. 


the total number of logical steps specifying the length of 
the longest path through the algorithm’s circuit assuming 
maximum parallelism), the number of elementary (1- and 
2-qubit) gate operations (thereby tracking the quantity 
of each particular type of gate operation), and ‘T-depth’ 
(i.e., the total number of logical steps containing at least 
one T-gate operation, meaning the total number of T- 
gate operations that cannot be performed in parallel but 
must be implemented successively in series). While we 
are not considering the costs of QEC in this paper, it is 
nevertheless important to know that, when QEC is con¬ 
sidered, the T gate, as a non-transversal gate, has a much 
higher per-gate resource cost than the transversal gates 
X, Y, Z, H, S, and CNOT, and thus contributes more to 
algorithm resources relative to the latter. It is for this 
reason that we call out the T-depth separately. 

Note that the analysis in this paper involves only the 
abstract algorithmic-level logical resources; i.e., we ignore 
all additional costs that must be taken into account when 
implementing the algorithm on a fault-tolerant real-world 
quantum computer, namely, resources associated with 
techniques to avoid, mitigate or correct errors which oc¬ 
cur due to decoherence and noise. More specifically, here 
we omit the overhead resource costs associated with var¬ 
ious QC and QEC strategies. We furthermore assume no 
connectivity constraints, thus ignoring resources needed 
to establish fault-tolerant quantum communication chan¬ 
nels between two distant (physically remotely located) 
qubits which need to interact in order to implement a 
two-qubit gate such as a CNOT in the course of the al¬ 
gorithm implementation. Besides being an indispensable 
first step towards a complete resource analysis of any 
quantum algorithm, focusing on the algorithmic-level re¬ 
sources allows setting a lower limit on resource demands 
which is independent of the details of QEC approaches 
and physical implementations, such as qubit technology. 

To be able to represent large circuits and determine 
estimates of their resource requirements, we take advan¬ 
tage of repetitive patterns and the hierarchical nature 
of circuit decomposition down to elementary quantum 
gates and its associated coarse-grained profiling of logical 
resources. For example, we generate ‘templates’ repre¬ 
senting circuit blocks that are reused frequently, again 
and again. These templates capture both the quantum 
circuits of the corresponding algorithmic building-blocks 
(subroutines or multi-qubit gates) and their associated 
resource counts. As an example, is is useful to have a 
template for Quantum Fourier Transform (or its inverse) 
acting on n qubits; for other templates, see Figl2] and 
appendix El The cost of a subroutine may thereby be 
measured in terms of the number of specified gates, data 
qubits, ancilla uses, etc., or/and in addition in terms of 
calls of lower-level sub-subroutines and their associated 
costs. Furthermore, the cost may vary depending on in¬ 
put argument value to the subroutine. Many of the in¬ 
termediate steps represent multi-qubit gates that are fre¬ 
quently used within the overall circuit. Such intermediate 
representations can therefore also improve the efficiency 
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of data representation. Accordingly, each higher-level cir¬ 
cuit block is decomposed in a hierarchical fashion, in a 
series of steps, down to elementary gates from the stan¬ 
dard set {X, Y, Z, H, S, T, CNOT}, using the decomposi¬ 
tion rules for circuit templates (see appendices [T] and [2] 
for details). 

Indeed, QLSA works with many repetitive patterns 
of quantum circuits involving numerous iterative opera¬ 
tions, repeated a large number of times. Repetitive pat¬ 
terns arise from the well-established techniques such as 
Quantum Phase Estimation, Quantum Amplitude Esti¬ 
mation, and Hamiltonian Simulation based on Suzuki- 
Higher-Order Integrator decomposition and Trotteriza- 
tion. These techniques involve large iterative factors, 
thus contributing many orders of magnitude to resource 
requirements, in particular to the circuit depth. Indeed, 
these large iterative factors explain why we get such large 
gate counts and circuit depth. 

It is useful to differentiate between the resources as¬ 
sociated with the ‘bare algorithm’ excluding oracle im¬ 
plementations and those which also include the imple¬ 
mentation of oracles. In order to perform the LRE, we 
chose an approach which combines manual analysis for 
the bare algorithm ignoring the cost of oracle implemen¬ 
tations (see Sec. inii) with automated resource estimates 
for oracles generated via the Quipper programming lan¬ 
guage and compiler (see Sec. El) . Whereas a manual LRE 
analysis was feasible for the bare algorithm thus allow¬ 
ing a better understanding of its structural ‘profiling’ as 
well as checking the reliability of the automated resource 
counts, it was not feasible (or too cumbersome) for the 
oracle implementations. Hence, an automated LRE was 
inevitable for the latter. The Quipper programming lan¬ 
guage is thereby demonstrated as a universal automated 
resource estimation tool. 


III. QUANTUM LINEAR SYSTEM 
ALGORITHM AND ITS PROFILING 

A. General remarks 

QLSA computes the solution of a system of linear equa¬ 
tions, Ax = b, where A is a Hermitian N x N matrix 
over C and x, b S C'^. For this purpose, the (classi¬ 
cal) linear system is converted into the corresponding 
quantum-theoretic analogue, A\x) = |6), where |a;), |6) 
are vectors in a Hilbert space T-L = (C^)®" corresponding 
to n = [log 2 N~\ qubits and A is a Hermitian operator 
on T-L. Note that, if A is not Hermitian, we can define 
A := (^t g ), b := (b,0)^, and x := (0,x)^, and restate 
the problem as Ax = b with a Hermitian 2N x 2N matrix 
A and x, b G C^'^. 

The basic idea of QLSA has been outlined in the Intro¬ 
duction. In what follows, we illustrate the structure of 
QLSA including the recently proposed generalization Q 
in more detail. In particular, we expand on its coarse¬ 
grained profiling with respect to resources it consumes. 


Our focus in this section is the implementation of the 
bare algorithm, which accounts for oracles only in terms 
of the number of times they are queried. The actual 
quantum-circuit implementation of oracles is presented 
in Sec. m Our overall LRE results are summarized in 
Sec. |V1 

B. Problem specification 

We analyze a concrete example which was demon¬ 
strated as an important QLSA application of high prac¬ 
tical relevance in Q: the linear system Ax = b arising 
from solving Maxwell’s equations to determine the elec¬ 
tromagnetic scattering cross section of a specified target 
object via the Finite Element Method (FEM) [^ . Ap¬ 
plied in sciences and engineering as a numerical tech¬ 
nique for finding approximate solutions to boundary- 
value problems for differential equations, FEM often 
yields linear systems Ax = b with highly sparse ma¬ 
trices - a necessary condition for QLSA. The FEM ap¬ 
proach to solving Maxwell’s equations for scattering of 
electromagnetic waves off an object, as demonstrated in 
d, m, [ 25 , introduces a discretization by breaking up the 
computational domain into small volume elements and 
applying boundary conditions at neighboring elements. 
Using finite-element edge basis vectors [ 2 ^, the system 
of differential Maxwell’s equations is thereby transformed 
into a sparse linear system. The matrix A and vector b 
comprise information about the scattering object; they 
can be derived, and efficiently computed, from a func¬ 
tional that depends only on the discretization chosen and 
the boundary conditions which account for the scattering 
geometry. For details, see [H, and @ including its 
supplementary material. 

Within the scope of the PLATO project, we ana¬ 
lyzed a 2D toy-problem given by scattering of a lin¬ 
early polarized plane electromagnetic wave E{x,y) = 
i?opexp[z(A; • r — tot)], with magnitude Eq, frequency w, 
wave vector k — k^cosOe^ sinOey), and polarization 
unit vector p = x k/k, while r = xe^ -f yey is the 
position, off a metallic object with a 2-dimensional scat¬ 
tering geometry. The scattering region can have any ar¬ 
bitrary design. A simple square shape was specified for 
our example problem, whose edges are parallel (or per¬ 
pendicular) to the Cartesian x-y plane axes, and an inci¬ 
dent field propagating in x-direction (0 = 0) towards the 
square, as illustrated in Fig.[T] The receiver polarization, 
needed to calculate the far-held radar cross-section of the 
scattered waves, has been assumed to be parallel to the 
polarization of the incident held. 

For the sake of simplicity, for FEM analysis we used a 
two-dimensional uniform hnite-element mesh with square 
hnite elements. Note that QLSA requires the matrix el¬ 
ements to be efficiently computable, a constraint which 
restricts the class of FEM meshes that can be employed. 
As a result of the local nature of the hnite-element ex¬ 
pansion of the scattering problem, the corresponding lin¬ 
ear system has a highly sparse matrix A. For meshes 



FIG. 1. A 2D toy-problem: scattering of a linearly polarized 
plane electromagnetic wave off a metallic object with a 2- 
dimensional scattering geometry. A simple square was chosen 
for our example problem, with edges of length L = 2\ aligned 
with the Cartesian x-y plane axes, and an incident held with 
wavelength A and wavevector k = {2n/X)ex propagating to¬ 
wards the square. When interacting with the metallic object 
the electromagnetic wave scatters off into all directions. The 
task consists in computing the far-held radar cross-section us¬ 
ing the FEM approach to solve Maxwell’s equations. 

with rectangular finite elements, the maximum number 
of non-zero elements in each row of A (i.e., sparseness) 
is d = 7. Moreover, for regular grids, such as used for 
our analysis, we obtain a banded sparse matrix A, with a 
total oi Nb = 9 bands. 

The actual instructions for computing the elements 
of the linear system’s matrix A and vector b, as well 
as of the vector R whose overlap with the solution x 
is used to calculate the far-held radar cross-section (see 
Sec. IIII CI). a re specihed in our Quipper code for QLSA, 
see inniS. The metallic scattering region is thereby 
given in terms of an array of scattering nodes denoted 
as ‘scatteringnodes’. Here we briefly summarize the 
FEM dimensions and the values of all other system pa¬ 
rameters that are necessary to reproduce the analysis. 
For all other details, we refer the reader to our QLSA’s 
Quipper code and its documentation in 

The total number of FEM vertices in x and y dimen¬ 
sions were Ux = 12,885 and Uy = 12,885, respectively, 
yielding N = nx{ny — 1) -I- ny{nx — 1) = 332,020,680 
for the total number of FEM edges, which thus deter¬ 
mines the number of edge basis vectors, and hence also 
the size of the linear system, and in particular the size 
of the N X N matrix A. The lengths of FEM edges in 
X and y dimensions were lx = O.lm and ly = 0.1m, 
respectively. The analyzed 2D scattering object was a 
square with edge length L = 2A, which in our analysis 
was placed right in the center of the FEM grid. In our 
Quipper code for QLSA Bin it is represented by the 
array ‘scatteringnodes’ containing the corner vertices 
of the scattering region. The dimensions of the scatter¬ 
ing region can also be expressed in terms of the number 
of vertices in x and y directions; using A = Im, the scat- 
terer was given by a 200 x 200 square area of vertices. The 
incident and scattered field parameters were specified as 


follows. The incident field amplitude, wavenumber and 
angle of incidence were set Eq = 1.0 E/m, k = 27rm“^ 
(implying wavelength A = Im) and 0 = 0, respectively. 
The receiver (for scattered held detection) was assumed 
to have the same polarization direction as the incident 
held and located along the x-axis (at angle ^ = 0). The 
task of QLSA is to compute the far-held radar cross- 
section with a precision specihed in terms of the multi¬ 
plicative error bound e = 0.01. 

Finally, we remark that our example analysis does 
not include matrix preconditioning that was also pro¬ 
posed in to expand the number of problems that can 
achieve exponential speedup over classical linear-system 
algorithms. With no preconditioning, condition numbers 
of the linear-system matrices representing a hnite element 
discretization of a boundary value problem typically scale 
worse than poly-log(A^), which would be necessary to at¬ 
tain a quantum advantage over classical algorithms. In¬ 
deed, as was rigorously proven in [l^, [1^, FEM matrix 
condition numbers are generally bounded from above by 
0(Af^/") for n > 3 and by 0{N) for n = 2, with n 
the number of dimensions of the problem. For regular 
meshes, the bound is valid for all n > 2. In 

our 2D toy problem, n = 2 and the mesh is regular, im¬ 
plying that the condition number is bounded by 0{N). 
However, we used the much smaller value k = 10'^ from 
lARPA GFI to perform our LRE. This ‘guess’ can be 
motivated by an estimate for the lower bound of k that 
we obtained numerically^ 


The condition number of a matrix A is defined by K.p{A) = 
II A||p|| lip, where || • ||p denotes the matrix norm that is used to 

induce a metric. Hence, the condition number is also a function 
of the norm which is used. The 1-norm || • ||i and 2-norm 11-112 
commonly used to define the condition number, and obviously 
^1 ^ ^2 in general. But due to ||A||i/\/]V < ||A ||2 < \/iV||A||i 
for N X N matrices A, knowing the condition number for either 
of these two norms allows to bound the other. Furthermore, if A 
is normal (i.e. diagonalizable and has a spectral decomposition), 
then K .2 = |Amax|/|Amin|, where Amax and Amin are the maximum 
and minimum eigenvalues of A. For a regular mesh of size fi, ac 2 
generally scales as 0{h~^) [2^ . l28l .[^ . Hence, because the num¬ 
ber of degrees of freedom scales as N = k .2 is bounded 

by (see [ 2 ^ . 1 ^ for rigorous proof). In our toy problem, 

h ^ 0.1 whereas N ^ S x 10^, thus it is not evident whether a 
guess for K 2 should be based on 0{h~^) or O(A^), as the two 
bounds indeed differ by many orders of magnitude. Besides, as 
our LRE analysis aims at achieving an optimistic (as opposed 
to an overly conservative) resource count for QLSA, it is more 
sensible to use the lower bound rather than the upper bound 
as a guess for K 2 . Hence, we attempted to find an actual lower 
bound for K 2 numerically. To this end, because an estimate for 
Ki can be obtained with much less computational expense than 
for K 2 for a given matrix of a very large size, we used Matlab 
and extrapolation techniques to attain a rough approximation of 
Ki from the given code specifying the matrix of our toy problem. 
We found a value ki ^ 10^. This allowed us to infer a rough 
estimate for the lower bound for K 2 . Indeed, using the above 
relation between the matrix norms || • ||i and || • II 2 for a square 
matrix and realizing that both ||A||i and ||A ||2 have values of 
order 0(1), we may conclude that K 2 > Ki/y/N x 0(1), which is 
of order approximately 10^ — 10“^. 
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C. QLSA — abstract description 


The generalized QLSA Q is based on two well-known 
quantum algorithm techniques: Quantum Phase Es¬ 

timation Algorithm (QPEA) [a 0|i which uses Quan¬ 
tum Fourier Transform (QFT) [l| as well as Hamiltonian 
Simulation (HS) [ 8 | as quantum computational primi¬ 
tives, and (2) Quantum Amplitude Estimation Algorithm 
(QAEA) [22l |. which uses Grover’s search-algorithm prim- 
itive. The purpose of QPEA, as part of QLSA, is to 
gain information about the eigenvalues of the matrix A 
and move them into a quantum register. The purpose 
of the QAEA procedure is to avoid the use of nondeter- 
ministic (non-unitary) measurement and post-selection 
processes by estimating the quantum amplitudes of the 
desired parts of quantum states, which occur as superpo¬ 
sitions of a ‘good’ part and a ‘bad’ par 1 @. 

QLSA requires several quantum registers of various 
sizes, which depend on the problem size N and/or the 
precision e to which the solution is to be computed. We 
denote the j-th quantum register by Rj, its size by Uj, 
and the quantum state corresponding to register Rj by 
V’ is a label for the state). The following Ta¬ 
ble 11] lists all logical qubit registers that are employed by 
QLSA, specified by their size as well as purpose. The reg¬ 
ister size values chosen (provided in GFI within the scope 
of lARPA QCS program) correspond to the problem size 
N = 332, 020,680 and algorithm precision e = 0.01. 

For example, the choice uq = [log 2 M] = 14 for the 
size of the QAE control register can be explained as fol¬ 
lows. According to the error analysis of Theorem 12 
in using QAEA the modulus squared 0 < a < 1 
of a quantum amplitude can be estimated within ±ea of 
its correct valucru with a probability at least S/tt^ for 
fc = 1 and with a probability greater than 1 — 
for k > 2, if the QAE control register’s Hilbert space 
dimension M is chosen such that (see 


, _ , 2fc7rWa(l — a) fc^Tr^ 

la - a| < -- - + -TTTr < ea , 


M 


M2 


(4) 


where a (0 < a < 1) denotes the output of QAEA. More¬ 
over, if a = 0 then a = 0 with certainty, and if a = 1 
and M is even, then d = 1 with certainty. Corollary ® 
can be viewed as a requirement used to determine the 
necessary value of M, yielding (for a 0) 


M > 


— -j= (\/l — a -I- \/l — a -I- e) 
eya 


(5) 


The RHS of this expression is strictly decreasing, tending 
to as a becomes close to 1 , whereas for a 1 we 

y/eOL ’ 


^ Let ly?) = I'bgood) + It^junk) be a superposition of the good and the 
junk components of a (normalized) quantum state li/i). The goal 
of QAEA is to estimate a := (bgoodl bgood), he. the modulus 
squared of the amplitude of the desired good component. 

Note that we hereby use a multiplicative error bound to represent 
the desired precision of QAEA’s computation. 


Qubit Register 

Size 

Purpose 

Ro 

no = [logj M] = 14 

QAE control register 

Ri 

«-! = riog2 T] = 24 

HS control register 

R 2 , Ro 

W2 = riog2(2A)] = 30 

quantum data register 

Ri, Rs 

n4 = ne = 65 

computational register 

Re..., Rio 

1 

single-qubit ancilla 

Rii 

«-! = riog2 T] = 24 

auxiliary register 
for Integerinverse 

Ri 2 

W2 = riog2(2A)] = 30 

auxiliary register 

for HS subroutines 


TABLE I. QLSA logical qubit registers specified by their size 
and purpose. The parameters M and T characterize the preci¬ 
sion of the QAE and QPE procedure, respectively. According 
to the error analysis in [^, choosing M = )1 ensures 

that the modulus squared a of a quantum amplitude can be 
estimated by QAEA with a probability greater than 1 — e 
within ±ea of its correct value, with t specifying the desired 
precision, which in our analysis is chosen to be 0.01. Registers 
i ?2 and Ri are used for storing and processing the quantum 
data such as |6), \x) and |i?). Computational registers R 4 and 
i ?5 are used to hold signed integer values, where the last bit is 
the sign bit, with the convention that 0 stands for a positive 
number and 1 for a negative number, respectively. Several 
single-qubit auxiliary (ancilla) registers Re , Rio are em¬ 
ployed throughout the algorithm. In addition, an ni-qubit 
ancilla register Rn is needed to store the inverse values 
of the eigenvalues of matrix A, and a further n 2 -qubit ancilla 
register R 12 must be employed as part of HS subroutines. 


have M > r^[(l - f) + (1 - ^)]1 = \^^. Hence, 
we take M > \, so as to account for all possibilities. 
Moreover, we want QAEA to succeed with a probabil¬ 
ity close to 1 , allowing failure only with a small error 
probability p„r- According to Theorem 12 in [ 2 ^, this 
indeed can be achieved when 1 — 2 {k-i) ^ 1 ~ Perr, be., 
for k> ri -I- and thus for 

' ^perr ' 


M > 




( 6 ) 


While we may assume any value for the failure proba¬ 
bility, for the sake of simplicity we here choose perr = e, 
which is also the desired precision of QLSA. Unless a is 
very small, this justifies our choice M = 2 r'°S 2 (i/<= )1. A 
similar requirement for the value of M was also proposed 
in the supplementary material of In our example 

computation, e = 0.01, and so we have uq = 14. Note 
that small a values require an even larger value for the 
QAE control register size in order to ensure that the es¬ 
timate a is within iLea of the actual correct value with 
a success probability greater than 1 — e. 

As a first step, QLSA prepares the known quantum 
state 16)2 = \j )2 a multi-qubit quantum data 

register R 2 consisting of 712 = |"log2(2A^)] qubits. This 
step requires numerous queries (see details below) of an 
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oracle for vector b. Moreover, as pointed out in S, ef¬ 
ficient quantum state preparation of arbitrary states is 
in general not always possible. However, the procedure 
proposed in @ can efficiently generate the state 


The Hamiltonian quantum state evolution is accom¬ 
plished by multi-querying an oracle for matrix A and HS 
techniques Q, which particularly include the decompo¬ 
sition of the Hamiltonian matrix into a sum 


I^t)2.6 = cos((/)b) 



|0)g -b sin(06) |&)2 O |1)6 


(7) 


where the multi-qubit data register R 2 contains (as a 
quantum superposition) the desired arbitrary state \b) 
entangled with a 1 in an auxilliary single-qubit register 


Re, as well as a garbage state 


b) 


(denoted by the tilde) 


entangled with a 0 in register Re- To generate the state 
0 , in addition to data registers R 2 and single-qubit aux¬ 
illiary register Re, two further, computational registers 
i ?4 and Re are employed, each consisting of auxiliary 
qubits. The latter registers are used to store the magni¬ 
tude and phase components, which in are denoted as 
bj and (pj, respectively, that are computed each time the 
oracle b is queried. Which component (j = 1,2,3,...) 
to query is thereby controlled by data register R 2 . The 
quantum circuit for state preparation [Eq. (I7|)] is shown 
in Sec. IHID31 Fig. m Following the oracle b queries, a 
controlled phase gate is applied to the auxilliary single¬ 
qubit register Re, controlled by the calculated value of 
the phase carried by quantum register Re; in addition, 
the single-qubit register Re is rotated conditioned on 
the calculated value of the amplitude carried by quan¬ 
tum register i? 4 . Uncomputing registers R 4 and Re in- 
vlolves further oracle b calls, leaving registers R 2 and 

Re in the state 0 with sin^ 4>b = 

cos^ (j)b = ^ (1 - where Cb = l/ma.x{bj), 

cf. @. 


As a second step, QPEA is employed to acquire in¬ 
formation about the eigenvalues Xj of A and store them 
in a multi-qubit control register Ri consisting of ni = 
[log 2 T] qubits, where the parameter T characterizes 
the precision of the QPEA subroutine and is speci¬ 
fied in Table HI This high-level step consists of several 
hierarchy levels of lower-level subroutines decomposing 
it down to a fine-grained structure involving only ele¬ 
mentary gates. More specifically, controlled Hamilto¬ 
nian evolution X]t=o^ d'^) ('^Di ® [exp(—iATto/T ')]2 ® Ig 
with A as the Hamiltonian is applied to quantum state 
® \bT) 2 e- Here, similar to the presentation in Q, 
a time constant tg such that t = rto/T < to has been 
introduced for the purpose of minimizing the error for 
a given condition number k and matrix norm |1A||. As 
shown in Q, for the QPEA to be accurate up to error 
0(e), we must have to ^ O^k/c) if ||A|| ^ 0(1). Ac¬ 
cordingly, we define to '■= ||A||K/e. The apphcation of 
exp(—iArtg/T) on the data register R 2 is thereby con- 
trofled by ni-qubit controf register i?i prepared in state 
|0)i = 77®”^ |0)®”^ = -^Er^o 1^)1 (with H denoting 
the Hadammard gate). Controlled Hamiltonian evolu¬ 
tion is subsequently followed by a QFT of register Ri to 
complete QPEA. 


71 = (8) 

i=i 

of sub-matrices, each of which ought to be 1 -sparse, as 
well as the Suzuki higher-order Integrator method and 
Trotterization [^. In the general case, an arbitrary 
sparse matrix A with sparseness d can be decomposed 
into m = 6 d^ 1 -sparse matrices Aj using the graph¬ 
coloring method, see Q. However, a much simpler de¬ 
composition is possible for the toy-problem example con¬ 
sidered in this work. Indeed, a uniform finite-element 
grid has been used to analyze the problem specified in 
the GEL For uniform finite-element grids the matrix A 
is banded; furthermore, the number and location of the 
bands is given by the geometry of the scattering problem. 
Hence, to decompose the Hamiltonian matrix [Eq. (|5])], 
the simplest way do so is to break it up by band into 
m = Nb sub-matrices, with Aj denoting the j-th non¬ 
zero band of matrix A, and Nb denoting the overall num¬ 
ber of its bands. For the square hnite-element grid used 
in the analyzed example, Nb = 9. Moreover, because 
the locations of the bands are known, this decomposition 
method requires only time of order 0(1). Having the ma¬ 
trix decomposition ([5]) , it is then necessary to implement 
the application of each individual one-sparse Hamilto¬ 
nian from this decomposition to the actual quantum state 
of the data register i? 2 . This ‘Hamiltonian circuit’ can 
be derived by a procedure resembling the techniques of 
quantum-random-walk algorithm and is discussed in 

more detail in Sec. IHID 51 

After QPEA has been acomplished including the QFT 
of register Ri, the joined quantum state of registers i?i, 
i ?2 and Re becomes, approximately. 


N 


i=i 


® 1^1)2 ® 1^)5 


-b sin(((>6)/3j |Aj)^ O \uj)^ O |l)g ) , (9) 


lere Xj and \uj) are the eigenvalues and eigenvectors 
A, respectively, and [ 6)2 = 'n!j=iPj \'^j )2 “ 

^=1 Pj \'^o )2 bhe expansions of quantum states [ 6)2 
5^ , respectively, in terms of these eigenvectors, and 


• = \ /n / 9? 


As a third step, a further single-qubit ancilla in register 
i ?7 is employed, initially prepared in state [ 0)7 and then 
rotated by an angle inversely proportional to the value 











11 


stored in register i?i, yielding the overall state: 


N 

l^)l,2,6,7 “ 'y / ^ COS((/){,)/3j ® 1 ^ 1)2 ® I®) 


i=l 


sH<j3b)l3j Aj) (g) \uj)^ (g) |l)e 




C 


'1 -tt 10)7+ 1^11)7 1 > (10) 




where C := l//c is chosen such that C/Xj < 1 for all j, 
because of K = Amax/Amin- 

Finally, the eigenvalues stored in register i?i are un¬ 
computed, by the inverse QFT of i?i, inverse Hamilto¬ 
nian evolution on R 2 and on i?i, leaving registers 

i?i, i? 2 , -Rs; and Ry in the state 


N 

1 ^) 1 , 2 , 6 ,7 | 0 )l ® X! ( COs{4>b)i3j \Uj)^ (g) |0)g 

+ sm{(j)b)Pj luj)^ (g) | 1)6 


With respect to the particular application example 
that has been analyzed here, namely, solving Maxwell’s 
equations for a scattering problem using the FEM tech¬ 
nique, we are interested in the radar scattering cross- 
section (RCS) CTrcsj which can be expressed in terms 
of the modulus squared of a scalar product, CTrcs = 

|R • xp, where x is the solution of Ak = b and R is an 
-/V-dim vector whose components are computed by a 2D 
surface integral involving the corresponding edge basis 
vectors and the radar polarization, as outlined in detail 
in Thus, to obtain the cross section using QLSA, 
we must compute | (R|a:) p, where |R) is the quantum- 
theoretic representation of the classical vector R. It is 
important to note that, whereas |R) and |a;) are normal¬ 
ized to 1, the vectors R and x are in general not normal¬ 
ized and carry units. Hence, after computing | (R|a:) p, 
units must be restored to obtain |R • xp. 

As for \b), the preparation of the quantum state |R) 
is imperfect. Employing the same preparation procedure 
that has been used to prepare |&t), but with oracle R 
instead of oracle b, we can prepare the entangled state 


®(y(^|0), + £|i),) . (11) 

Ignoring register Ri and collecting all terms that are not 
entangled with the term 11)5 0 11)7 into a ‘garbage state’ 
1 ^ 0)2 6 71 common quantum state of registers R 2 , Re, 
and Ry can be written as, see 

1^)2,6.7 = (1 - sin^{</>6) 1^0)2,6.7 

-f sin((/){,)sin(02;) |a;)2 (g) |l)g g) 11)7 , ( 12 ) 

where 



is the normalized solution to A jx) = \b) stored in register 
R 2 and sin^ := l/^iP/A^- Note that the solu¬ 

tion vector [Eq. dT^ ] in register R 2 is correlated with the 
value 1 in the auxiliary register Ay. Hence, if register Ry 
is measured and the value 1 is found, we know with cer¬ 
tainty that the desired solution of the problem is stored 
in the quantum amplitudes of the quantum state of regis¬ 
ter i? 2 , which can then either be revealed by an ensemble 
measurement (a statistical process requiring the whole 
procedure to be run many times) or useful information 
can also be obtained by computing its overlap |(R|x)| 
with a particular (known) state |R) (corresponding to 
a specific vector R G C^) that has been prepared in a 
separate quantum register. To avoid non-unitary post¬ 
selection processes, CJS-QLSA @ employs QAEAF^ 


Note that l/Amax ^ Ks'mcpx ^ 1/^min’ which suggests that 


|Rt) 3,8 = cos((/)r) R^^(g)| 0 ) 8 -hsin(())^) |R) 3 (g)|l)g , (14) 


where the multi-qubit quantum data register R 3 consist¬ 
ing of 713 = riog 2 ( 2 A^)] qubits contains (as a quantum 
superposition) the desired arbitrary state |R) entangled 
with value 1 in an auxilliary single-qubit register i?g, as 


well as a garbage state 


R 


(denoted by the tilde) en¬ 


tangled with value 0 in register Rg. Moreover, the am¬ 
plitudes squared are given as sin^ 4 >r = ^ 
and cos^ cj)^ = ^ (l “ where Cn = 

l/max(Rj), cf. [5|. As outlined in the state (fTdl) is 
adjoined to state ([T^ along with a further ancilla qubit 
in single-qubit register Rg that has been initialized to 
state |0)g. Then, a Hadamard gate is applied to the an¬ 
cilla qubit in register Rg and a controlled swap operation 
is performed between registers Rg and R 3 controlled on 
the value of the ancilla qubit in register Rg , which finally 
is followed by a second Hadamard transformation of the 
ancilla qubit in register Rg. After a few simple classical 
transformations, the algorithm can compute the scalar 
product between |a;) and |R) as, cf. 


|(R|x)p 


Riiio ~ Run 
sin^ (f>b sin^ (j)x sin^ (jjr ’ 


(15) 


where Rmo and Run denote the probability of measur¬ 
ing a ‘1’ in the three ancilla registers Re, Ry and Rg 
and a ‘0’ or ‘1’ in ancilla register Rg, respectively. Fi¬ 
nally, after restoring the units to the normilized output 


M ~ 0{Kle) would be sufficient to estimate := sin'^ (px with 
multiplicative error e, see corollary ©. This is a conservative 
estimate, and the implied associated cost for the QAE step is 
indeed by a factor 0(«:) higher than that assumed by CJS in 
deriving the overall complexity [Eq. ([3jl]. 
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of QLSA, the RCS in terms of quantities received from D. QLSA — algorithm profiling and 

the quantum computation is, cf. quantum-circuit implementation 


'^RCS 


1 sin^ (j)h 

4^ ClC^ sin^ 


(sin^ (j)ro — sin^ 


0rl) : 


(16) 


1 1 

where sin(/)ro := ^iiio sini^r and sint/f^i := Pim smcj)^. 

It is important to note that, because all the employed 
state-preparation and linear-system-solving operations 
are unitary, the four amplitudes sin (j)b, sin (j)^, sin (j)ro and 
sin (j)ri that are needed for the computation of the RCS 
according to Eq. (1161) can be estimated nearly determin¬ 
istically (with error e) using QAEA which allows to avoid 
nested non-deterministic subroutines involving postselec- 
tionI3 Yet, there is a small probability of failure, which 
means that QLSA can occasionally output an estimate 
(Trcs that is not within the desired precision range of the 
actual correct value ctrcs- The failure probability is gen¬ 
erally always nonzero but can be made negligibleF^ 


However, it ought to be noted that, by ‘principle of deferred 
measurements’ (see 0). for any quantum circuit involving mea¬ 
surements whose results are used to conditionally control subse¬ 
quent quantum circuits, the actual measurements can always be 
deferred to the very end of the entire quantum algorithm, with¬ 
out in any way affecting the probability distribution of its final 
outcomes. In other words, measuring qubits commutes with con¬ 
ditioning on their postselected outcomes. Hence, any quantum 
circuit involving postselection can always be included as a sub¬ 
routine using only pure states as part of a bigger algorithm with 
probabilistic outcomes. Nonetheless, in view of the resources 
used to achieve efficient simulation, measuring qubits as early as 
possible can potentially reduce the maximum number of simulta¬ 
neously employed physical qubit systems enabling the algorithm 
to be run on a smaller quantum computer. In addition, we here 
emphasize that, with a small amount of additional effort, QAEA 
can be designed such that its final measurement outcomes nearly 
deterministically yield the desired estimates. Note that a simi¬ 
lar concept also applies to QAA in HHL-QLSA, which aims at 
amplifying the success probability. 

The RCS in Eq. l|16|l is of the form ctrcs = C^(ck 3 — 04 ), 
where C is a constant and ai {i = 1,...,4) are the moduli 
squared of four different quantum amplitudes to be estimated 
using QAEA. The QAE control register size no has been chosen 
such (see Table [1} that, with a success probability greater than 
1—e, respectively, the corresponding estimates are within ibeai of 
the actual correct values, i.e. = cxi dzeoi. It is straightforward 
to show that, with only a single run of each of the four QAEA 
subroutines, our estimate ctrcs = (as— 014 ) for RCS satisfies 

dRcs = crRCsdiecrRcsdie(jRcsdie(jRcs + 0(€^), and hence I^rcs — 
•^Rcsl ^ 3ecrRcs5 with a probability at least (1—e)'^ 1—4e. Note 

that, to ensure |d-Rcs ~ ctrcsI ^ ^ctrcs with a probability close 
to 1, we actually should have chosen an even higher calculation 
accuracy for each of the four QAEA subroutines, achieved by 
using the larger QAE control register size Uq = |'log 2 Af'], where 
M' = )] ^ enabling estimations with the smaller error 

e' := e/4. However, we avoided these details in our LRE analysis, 
which aims at estimating the optimistic resource requirements 
that are necessary (not imperatively sufficient) to achieve the 
calculation accuracy e = 0.01 for the whole algorithm. 


The high-level structure of QLSA is captured by 
a tree diagram depicted in Fig. O It consists of several 
high-level subroutines hierarchically comprising (i) ‘Am¬ 
plitude Estimation’ (first level), (ii) ‘State Preparation’ 
and ‘Solve for x’ (second level), (hi) ‘Hamiltonian Sim¬ 
ulation’ (third level), and several further sub-level sub¬ 
routines, such as ‘HSimKernel’ and ‘Hmag’ that are used 
as part of HS. Fig. [2] illustrates the coarse-grained profil¬ 
ing of QLSA for the purpose of an accurate LRE of the 
algorithm, demonstrating the use of repetitive patterns, 
i.e., templates representing algorithmic building blocks 
that are reused frequently. Representing each algorith¬ 
mic building block in terms of a quantum circuit thus 
yields a step-by-step hierarchical circuit decomposition of 
the whole algorithm down to elementary quantum gates 
and measurements. The cost of each algorithmic build¬ 
ing block is thereby measured in terms of the number of 
calls of lower-level subroutines or directly in terms of the 
number of specified elementary gates, data qubits, ancilla 
uses, etc. 

To obtain an accurate LRE of QLSA, we thus need to 
represent each algorithmic building block in terms of a 
quantum circuit that then enables us to count elemen¬ 
tary resources. In what follows, we present quantum cir¬ 
cuits for selected subroutines of QLSA. Well-known cir¬ 
cuit decompositions of common multi-qubit gates (such 
as, e.g., Toffoli gate, multicontrolled NOTs, and W gate) 
and their associated resource requirements are discussed 
in the Appendix. 


1. The ‘main’ function QLSAjmain 

The task of the main algorithm ‘QLSA_main’ is to esti¬ 
mate the radar cross section for a FEM scattering prob¬ 
lem specified in GFI using the quantum amplitude es¬ 
timation sub-algorithms ‘AmpEst_(/){,’, ‘AmpEst_(/>a;’ and 
‘AmpEst_^r’ to approximately compute the angles corre¬ 
sponding to the probability amplitudes sin(^h), sin(^a;), 
sin((()ro) and sin((()r,i): 

fib- ^ AmpEst_0;,(Oracle_b) 

4>x- ‘‘r- AmpEst_0a,(Oracle_A, Oracle_b) 

frO- ^ AmpEst_(/)r(Oracle_A, Oracle_b, Oracle_R, 0) 

fri- ^ AmpEst_(/)r(Oracle_A, Oracle_b, Oracle_R, I) 

where in the last two lines ‘0’ and ‘I’ refer to the prob¬ 
ability of measuring value 0 or 1 on ancilla qubit in 
register Rg, respectively. It then uses these probabil¬ 
ity amplitudes (or rather their corresponding probabil¬ 
ities) to calculate an estimate of the radar cross sec¬ 
tion ctrcs = cr^cs{fb,fx,fr0,4>ri) according to Eq. ((H]), 
whereby this part uses only classical computation. The 
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FIG. 2. (Color online) Coarse-grained QLSA profiling overview. The high-level structure of QLSA consists of several high-level 
subroutines (represented as black-framed boxes) hierarchically comprising (i) ‘Amplitude Estimation’ (first level), (ii) ‘State 
Preparation’ and ‘Solve for x’ (second level), and (iii) ‘Hamiltonian Simulation’ (third level), which includes several further 
sub-level subroutines, such as ‘HSimKernel’ and ‘Hmag’. These subroutines are further ‘partitioned’ into more fine-grained 
repetitive algorithmic building-blocks (such as, e.g. QFT, oracle query implementations, multi-controlled NOTs and multi- 
controlled rotations, etc.) that are eventually hierarchically decomposed down to elementary quantum gates and measurements. 
Among them, well-known library functions, such as QFT, are shown as green-framed boxes; single-qubit measurements (in 
computational basis) and well-established composite gates and multi-qubit controlled gates (such as Toffoli, W gate and multi- 
controlled NOTs) are represented by purple-framed boxes; automated implementations of oracles and the ‘Integerinverse’ 
subroutine are illustrated as red-framed boxes. For multi-qubit gates, the number of qubits involved is indicated by a subscript 
or a prefix label; for example, a QFT acting on no qubits is represented as ‘QFT^^^’; a multi-controlled NOT employing n 2 
control quits is denoted as ‘n 2 -fold CNOT’. The number of calls of each algorithmic building-block is indicated by a labelled 
arrow. The cost of a subroutine is measured in terms of the number of specified gates, data qubits, ancilla uses, etc., or/and 
in terms of calls of lower-level sub-subroutines and their associated costs. Note that the cost may vary depending on input 
argument value to the subroutine. To obtain the LRE of the whole algorithm, multiply the number of calls of each lowest-level 
subroutine with its elementary resource requirement. The cost of the lowest-level subroutines and oracles is provided in the 
form of tables in the appendix. It also becomes apparent how the overall run-time of QLSA accrues through a series of nested 
loops consisting of numerous iterative steps that dominate the run-time and others whose contributions are insignificant and 
can be neglected. The dominant contributions to run-time are given by those paths within the tree diagram which include 
Hamiltonian Simulation as the most resource-demanding bottleneck, involving Trotterization with r ~ 10^^ time-splitting slices, 
with each Trotter slice involving iterating over each matrix band to implement the corresponding part of Hamiltonian state 
transformation, which (for each band) furthermore requires several oracle A implementations to compute the matrix elements. 
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FIG. 3. Circuit to implement subroutine ‘AmpEst_(^6’, which 
computes an estimate for angle (jib- The unitary transfor¬ 
mations Ub and Ug are explained in Figs. 14151 The amplitude 
estimation subroutine is completed by a QFT of the QAE con¬ 
trol register Ro (here represented by wires Igo), ■ •., Iflno-i)) 
and measuring it in the computational basis. The measure¬ 
ment result g = ((;[0], ... ,g[no — 1]) is recorded, y ■<— g, and 
used to compute the estimate (jib = (7rj//M), cf. [^ . 


result of the whole computation ought to be as pre¬ 
cise as specified by the multiplicative error term ihecrRcs! 
where the desired (given) accuracy parameter in our anal¬ 
ysis has the value e = 0.01. The LRE of the com¬ 
plete QLS algorithm is thus obtained as the sum of the 
LREs of the four calls of the quantum amplitude estima¬ 
tion sub-algorithms, respectively, that are employed by 
QLSA_main. 


2. Amplitude Estimation Subroutines 


In this subsection we present the quantum circuits 
of the three Amplitude Estimation subroutines ‘Am- 
pEst_^b’, ‘AmpEst_(()a;’ and ‘AmpEst_((),.’, which are 
called by ‘QLSA_main’ to compute estimates of the an¬ 
gles (j)b, 4>x, <prO and (j)ri that are needed to obtain an 
estimate for the RCS CTrcs- 

Subroutine AmpEst-(j)b — This subroutine computes 
an estimate for the angle (j)b, which determines the prob¬ 
ability amplitude of success sin(0f,) for the preparation 
of the quantum state |6) in register R 2 , see Eq. ([7]). Its 
algorithmic structure is represented by the circuits de¬ 
picted in Figs.[3][5l It employs subroutine ‘StatePrep.b’, 
which prepares the state [Eq. (jT])], and a Grover Iterator 
whose construction is illustrated by the circuit in Fig. [S] 


bo) — 


— bo) — 


bi) — 


— bi) — 

> 

b2) — 


— b2) — 

s 

bs) ^ 

Vb 

^ ba) — 

q 

ii 

x„(.,) — 


- x„(.,> — 

£ 

|6> — 


- \b) - 



FIG. 4. Unitary transformation Ub is an abbreviation for 
subroutine ‘StatePrep_b’, whose circuit representation is dis¬ 
cussed in Subsec. imp 31 


9/) - 

ho) — 
hi) — 
h2) — 
hs) — 
h",-i) — 
| 6 ) — 
|a)-_ 



FIG. 5. Quantum circuit of the (unitary) Grover iterator 
Ug employed by subroutine AmpEst_(()6; its action is to be 
controlled by control-register qubit g[j]. 


Subroutine AmpEsEefix — This subroutine computes 
an estimate for the angle (fix, which, together with the 
previously computed angle (j)b, determines the probabil¬ 
ity amplitude of success, sin((()b) sin((^ 2 ;), of computing 
the solution state ja;) in register i? 2 , see Eq. (HU. Its algo¬ 
rithmic structure is represented by the circuits depicted 
in Figs.inilSl It involves subroutine ‘StatePrep_b’, which 
prepares the quantum state 0, the subroutine ‘Solve_a::’, 
which implements the actual ‘solve-for-x’ procedure that 
incorporates all required lower-level subroutines such as 
those needed for Hamiltonian Simulation, and a Grover 
Iterator whose construction is given in Fig. [8l 

Subroutine AmpEstjjir — This subroutine computes 
an estimate for the angle (j>ro or (pri, respectively, which, 
together with the previously computed angles (jib and 4>x j 
determine the probability amplitude of sucessfuly com¬ 
puting the overlap integral {R\x). Its algorithmic struc¬ 
ture is represented by the circuits depicted in Figs. IMT^ 
It involves subroutines ‘StatePrep_b’ and ‘StatePrep_R’, 
which prepare the quantum states © and da, respec¬ 
tively, the subroutine ‘Solve_a;’, which implements the 
actual ‘solve-for-a;’ procedure, and furthermore a swapp 
protocol that is required for computing an estimate of 
{R\x), and finally a Grover Iterator whose construction 
is given by the quantum circuit in Fig. 1121 
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FIG. 6. Circuit to implement subroutine ‘AmpEst.c/ij;’, which 
computes an estimate for angle (jix- The unitary transforma¬ 
tions Ubx and Vg are explained in Figs. [7]l8l The amplitude 
estimation subroutine is completed by a QFT of the QAE con¬ 
trol register Ro (here represented by wires Igo) ,. •., |<?no-i)) 
and measuring it in the computational basis. The measure¬ 
ment outcome g = ((;[0],... ,g[no — 1]) is recorded, y t— g, 
and used to compute the estimate = {'KylM), cf. [^ . 



FIG. 7. Unitary transformation Ubx consists of two subrou¬ 
tines: ‘StatePrep_b’ followed by ‘Solve_ 2 ;’, whose circuit repre¬ 
sentations are discussed in Subsec. Till D 3l and Subsec. Till D 41 

3. State Preparation subroutine 

The state preparation subroutine ‘StatePrep’ is used to 
generate the quantum states |6t) and \Rt) in Eqs. ([7]) 
and m from given classical vectors b and R using 
the corresponding oracles and controlled phase and ro¬ 
tation gates. The circuit for generating Ibr) is depicted 
in Fig. [T31 A similar circuit is used to generate \Rt), by 
replacing the Oracle b by Oracle R. The subroutines ‘C- 
Phase’ and ‘C-RotY’ and their associated resource counts 
are discussed in appendix 1 2 g| and 1 2 hi respectively. The 
implementation of Oracles b and R is analyzed in Sec. lIVI 



FIG. 8. Quantum circuit of the (unitary) Grover iterator 
Vg employed by subroutine ‘AmpEst_(()a,’; its action is to be 
controlled by control-register qubit g[j]. 



FIG. 9. Quantum circuit to implement subroutine ‘Am- 
pEst_(()r’, which computes an estimate for the angle (j)ro or 
(j>ri, respectively, which, together with the previously com¬ 
puted angles 4 >b and (fix, are needed to calculate an estimate 
of RCS according to Eq. IIHI). The unitary transformations 
Ur and Qg are explained in Figs. I10I12I The amplitude esti¬ 
mation subroutine is completed by a QFT of the QAE control 
register Ro (represented by wires Igo) ,. •., |gno-i)) and mea¬ 
suring it in the computational basis. The measurement result 
g = (g[0],..., g[no — 1]) is recorded, g t— g, and used to com¬ 
pute the estimate (prf = {iTy/M), cf. [S^], depending on the 
value of the flag / G {0,1} used by unitary Qg, see Fig. [TT] 
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FIG. 10. Unitary transformation Ur is an abbreviation for the 
subroutine ‘Solve_a:r’, whose circuit representation is provided 
in Fig. [TTl 



FIG. 11. Definition of subroutine ‘Solve_a:r’ that is shown in 
Fig. [To] to define the unitary transformation Ur- This sub¬ 
routine starts with implementing the preparation of quan¬ 
tum states 0 and m in registers R 2 , Re and R 3 , Rs, (here 
given as |2;o),..., ia;n 2 -i> , !&) and \yo) n2-l) ! |r)), re¬ 

spectively; then it employs subroutine ‘Solve_a:’, which im¬ 
plements the actual “solve-for-af’ procedure; and finally, a 
Hadamard gate is applied to the ancilla qubit in register Rq 
(here labeled as |c)) and a controlled swap protocol is per¬ 
formed between registers R 2 and R 3 controlled on the value 
of the ancilla qubit in register Rq, which finally is followed by 
a second Hadamard gate on the ancilla qubit in register Rq. 
The swap protocol is required for computing an estimate of 
the overlap (i?|a;). 



FIG. 12. Quantum circuit of the (unitary) Grover iterator 
Qg employed by subroutine ‘AmpEst_(()r’; its action is to be 
controlled by control-register qubit g[j\. The value of the flag 
/ £ {0,1} determines whether the angle 4 >r 0 or (f)ri is to be 
estimated, respectively. 



FIG. 13. Quantum circuit to implement the subrou¬ 
tine ‘StatePrep(x, g; Oracle b, l/6niax)’, which generates the 
quantum state |&t )2 g in Ed- 0- In addition to the data 
register R 2 (function argument x; here represented by wires 
a:[0],..., 2 ;[n 2 — 1]) and single-qubit ancilla register Rq (here 
represented by wire q), the the procedure involves two fur¬ 
ther, auxiliary computational registers Ri and Re, each 
consisting of n 4 ancilla qubits (here represented by wires 
m[0],..., m[n 4 — 1] and p[0],... ,p[n 4 — 1]), respectively. The 
latter two registers m and p are used to store the magni¬ 
tude and phase components, bj and (pj, respectively. Fol¬ 
lowing the Oracle b queries, a controlled phase gate is ap¬ 
plied to the auxilliary single-qubit register q, controlled by 
the calculated value of the phase carried by n 4 -qubit ancilla 
register p; in addition, the single-qubit register q is rotated 
conditioned on the calculated value of the amplitude (mag¬ 
nitude) carried by the n 4 -qubit ancilla register m. Uncom¬ 
puting registers m and p invlolves further oracle b calls. The 
subroutine ‘StatePrep(y, r; Oracle r, l/i?max)’ generating the 
quantum state \Rt) is implemented by a similar circuit, with 
Oracle r instead of Oracle b. 
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J^. SolvejK subroutine 

Subroutine ‘Solve_a;(x, s; Oracle_A)’ is the actual 
linear-system-solving procedure, i.e., it implements the 
‘solve-for-a;’) transformation. More concretely, it takes 
as input the state | 6 t )2 e ®) been 

prepared in registers R2, Rq, and computes the state 
given in Eq. (USD which contains the solution state \x )2 = 
A~^ 1^)2 in register R 2 with success probability amplitude 
sin((/);,) sin((?!)a;). The arguments of this subroutine are x 
and s corresponding to the input states in data regis¬ 
ter i ?2 and single-qubit ancilla register i? 7 ; furthermore, 
Oracle_A occurs in the argument list to indicate that it 
is called by Solve.a; to implement the HS lower-level sub¬ 
routines. Note that ‘Solve-x’ does not act on register i? 6 . 

The quantum circuit for ‘Solve-x’ is shown in Fig. 1141 
It involves lower-level subroutines ‘HamiltonianSimula- 
tion’, QFT, ‘Integerinverse’, and their Hermitian conju¬ 
gates, respectively, and the controlled rotation ‘C-RotY’, 
which is defined and analyzed in appendix 1 2 hi 


5. Hamiltonian Simulation subroutines 

HS subroutines implement, as part of QPEA, the uni¬ 
tary transformation exp(—iArto/T), which is to be ap¬ 
plied to register R 2 , which together with register Rq has 
been prepared in quantum state | 6 t )2 6 ) whereby this 
Hamiltonian evolution is to be controlled by HS control 
register Ri and the Hamiltonian is specified by Oracle A. 

For a thorough HS analysis, see Q and further ref¬ 
erences therein. The decomposition of the banded 
Hamiltonian matrix A by band into a sum of sub¬ 
matrices, according to Eg. (1^ . and the Suzuki higher- 
order Integrator method [26j | with order k = 2 and 
Trotterization are all accomplished by subrou¬ 

tine ‘HamiltonianSimulation(x, t; OracleH)’, whose im¬ 
plementation is illustrated in Figs. ITBlfTTl The Suzuki- 
Trotter time-splitting factor, here denoted by r, can be 
determined by the following formula, cf. Q: 

r= ^ (I 7 ) 

where t = rto/T < to is the length of time the Hamilto¬ 
nian evolution must be simulated, and ||H|| is the norm 
of the Hamiltonian matrix. As was shown in Q, to en¬ 
sure algorithmic accuracy up to error bound e for sub¬ 
algorithm ‘Solve_ 2 :’, we must have to ^ 0{K/e). In our 
analysis, the time constant for Hamiltonian simulation 
was set to = 7«;/e, as suggested by the problem speci¬ 
fication in the lARPA GFI. Inserting the values fc = 2, 
Nb = 9, € = 0.01 and ||A||t < 7 x 10® into Eq. (flTll yields 
the approximate value r < 8 x 10^^. However, to en¬ 
sure accuracy e not only for the Hamiltonian evolution 
simulation bnt also for each of the three Amplitude Es¬ 
timation subroutines that employ subalgorithm ‘Solve.a;’ 
in — 1) calls, respectively, see Fig. [51 we would 

typically require a much smaller target accuracy for the 


implementation of the Hamiltonian evolution. Assum¬ 
ing errors always adding up, an obvious choice would be 
e! = e/(2”°+^ — 1), which, when inserted into Eq. (1171) in 
place of e, yields r k. 6.35 x 10^^. This is a fairly conserva¬ 
tive and unnecessarily large estimate, though. Following 
the suggestions in the GFI, for the purpose of our LRE 
analysis, we have used the somewhat smaller (average) 
value r = 2.5 x 10^^, which is roughly obtained by using 
the average Hamiltonian evolution time to/2 rather than 
the maximum HS time to in Eq. (HZl. 

Furthermore, the application of a controlled one-sparse 
Hamiltonian transformation to any arbitrary input state 
in register R 2 uses techniques resembling a generaliza¬ 
tion of the quantum-random-walk algorithm [30l| . Its 
implementation is the task of the two lower-level sub¬ 
routines ‘HsimKernel(t, X, band, timestep, OracleA)’ and 
‘Hmag(x, y, m, which are represented and illus¬ 
trated by circuits in Figs. [T51 and [T51 respectively. 


6. Oracle subroutines 

A quantum oracle is commonly considered a unitary 
‘black box’ labeled as Uf which, given the value x of an 
n-qubit input register TZi, efficiently and unitarily com¬ 
putes the value of a function / : { 0 , 1 }" —>■ { 0 , 1 }™ and 
stores it in an m-qubit auxiliary register 72,2 that has 
initially been prepared in state | 0 )®’”: 

Uf : |x)i G 10)2 ^ |x)i G |/(x))2 . (18) 

In our analysis, oracles must be employed for the purpose 
of state preparation (Oracle b or Oracle R) and Hamilto¬ 
nian simulation (Oracle A); they need to be constructed 
from mappings between the FEM global edge indices and 
the quantities defining the linear system, matrix A and 
vector b, as well as the ‘measurement vector’ R that is 
used to compute the RCS. 

Theoretically, oracle implementations are usually not 
specified. The efficiency of oracular algorithms is com¬ 
monly characterized in terms of their query complexity, 
assuming each query is given by an efficiently computable 
function. However, in practice oracle implementations 
must be accounted for. Our analysis aims at comprising 
all resources, including those which are needed to imple¬ 
ment the required oracles. Their automated implemen¬ 
tation using the programming language Quipper and its 
compiler is elaborated on in Sec. lYl Here we briefly dis¬ 
cuss the high-level tasks of these oracle functions. Their 
resource estimates are presented in appendix [3] 

Oracle b is used to prepare quantum state | 5 t )2 6 ’ 

Eq. ([T]) and Fig. [T31 Its task is accomplished by subrou¬ 
tine ‘Oracle_b(x, m, p)’, which takes as input the quan¬ 
tum state of the n 2 -qubit register R 2 (argument x; span¬ 
ning the linear system global edge indices), computes the 
corresponding magnitude value bj and phase value (pj, 
and stores them in the two auxiliary computational reg¬ 
isters i ?4 and i ?5 (labeled by arguments m and p), each 
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FIG. 14. Quantum circuit to implement subroutine ‘Solve_x(x, s; Oracle^H)’. Register R 2 (here represented by wires labeled as 
Ixq) ,..., |a;n 2 -i)) carries the input state |&t )2 e defined in Eq. (O; the register Re is ignored here, as Solve_x does not act on 
the latter. The output state of Solve-X is stored in register R 2 ; it contains the solution |x )2 = Ib)^ with success-probability 
amplitude sm{(j)b) sm{(l>x) (see Eq. (I12II . Quantum register Ri (here represented by wires |fo) ,.. ., |^ni-i)) is the control register 
for the HS procedure, which is represented by the unitary transformation Uhs that is defined in Fig. [15] and elaborated on 
below. Uhs and its Hermitian conjugate U\jg act on register i? 2 , with the action being controlled by |t)^^ that has been 
initialized to state := |0)®"^. Following Uhs, QFT is performed on register Ri to complete the implementation 

of QPEA and so acquire information about the eigenvalues of A and store them in register Ri. A local auxiliary ni-qubit 
register i?ii is employed (here represented by wires |/o),..., |/ni-i)) that has been initialized to state |0)jj = |0)®"^. By 
subroutine ‘Integerlnverse’:|t)j^ ® —>■ |t)j^ 0 |l/t)]^j^, whose implementation is discussed in Sec. IIVI ancilla register Rn 

obtains the inverse value of the eigenvalue \j stored in HS control register R\. Next, the controlled rotation ‘C-RotY’ (see 
appendix 1 2 hi for details) rotates the quantum state of single-qubit register R 7 (here labeled as |s)) by an angle proportional 
to the value stored in register Rii, i.e. inversely proportional to the eigenvalue stored in register Ri; this step implements the 
transformation yielding the quantum state in Eq. dloj. Finally, registers Ri and Rn are uncomputed and terminated by the 
inverse operation of Integerinverse on Ri and Rii, inverse QFT of Ri, inverse Hamiltonian evolution of R 2 , applying on 

i?i and measuring the value ‘0’ in all corresponding qubits; this step yields the common quantum state m for registers i?i, 
R 2 , Re, and R 7 . 
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FIG. 15. Unitary transformation Uhs is an abbreviation for 
the subroutine ‘HamiltonianSimulation(x, t; OracleA)’, whose 
quantum-circuit implementation is given below. 


consisting of 714 ancilla qubits and initialized (and later 
terminated) to states | 0 )®”^, respectively. 

Oracle R is used to prepare quantum state IRt)^ g in 
Eq. (ITU) . Its task is accomplished by subroutine ‘Ora- 
cle_R(x, m, p)’ which takes as input the quantum state 
of the n 2 -qubit register Rg (argument x; spanning the 
FEM global edge indices), computes the corresponding 
magnitude value rj and phase value (pj , and stores them 
in the two n 4 -qubit auxiliary computational registers i ?4 
and i? 5 , (labeled by arguments m and p), each initialized 
(and later terminated) to states |0)®”'‘, respectively. 

Oracle A is needed to compute the matrix A of the lin¬ 
ear system; it is employed as part of the HS subroutine 
‘HsimKernel’ to specify the 1-sparse Hamiltonian that is 
to be applied. This high-level task is accomplished by 
subroutine ‘Oracle_A(x, y, z; band, argfiag)’, which takes 
as input the quantum state of the n 2 -qubit register R 2 
(argument x; spanning the linear system global edge in¬ 
dices) and returns the connected Hamiltonian node in- 
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repeat r times 

FIG. 16. Quantum circuit to implement subroutine 
‘HamiltonianSimulation(x, t; Oracle^)’ which uses HS control 
register R\ (function argument t; represented by wires labeled 
as |to) ,..., |fni-i)) to apply a Hamiltonian transformation 
of register R 2 (function argument x; represented by wires la¬ 
beled as |a:o) ,..., |2:n2-i))5 with the Hamiltonian specified by 
Oracle A. This subroutine comprises the Suzuki higher-order 
Integrator method with order k = 2 and Trotter time-splitting 
factor r; the value r = 2.5 x 10^^ has been used for our LRE. 
The unitary transformations Uz{ti) and Uz{t 2 ) are defined in 
Fig. El 


dex storing it in an n 2 -qubit ancilla register i?i 2 (labeled 
by argument y); furthermore, it accesses Hamiltonian 
bands through the integer argument ‘band’ and, depend¬ 
ing on the value of the integer varable argflag S {0,1}, 
computes the corresponding Hamiltonian magnitude or 
phase value, respectively, and stores it in the correspond¬ 
ing auxiliary n 4 -qubit register z S {m, p}. 


IV. AUTOMATED RESOURCE ANALYSIS OF 
ORACLES VIA THE PROGRAMMING 
LANGUAGE QUIPPER 

The logical circuits required to implement the Oracles 
A, 5, and R were generated using the quantum program¬ 
ming language Quipper and its compiler. Quipper is also 
equipped with a gate-count operation, which enables per¬ 
forming automated LRE of the oracle implementations. 

Our approach is briefly outlined as follows. Oracles A, 
b and R were provided to us in the lARPA QCS program 
GFI in terms of Matlab functions, which return matrix 
and vector elements defining the original linear-system 
problem. The task was to implement them as unitary 
quantum circuits. We used an approach that combines 
‘Template Haskell’ and the ‘classical-to-reversible’ func¬ 
tionality of Quipper, which are explained below. This 


approach offers a general and automated mechanism for 
converting classical Haskell functions into their corre¬ 
sponding reversible unitary quantum gates by automat¬ 
ically generating their inverse functions and using them 
to uncompute ancilla qubits. 

This Section starts with a short elementary introduc¬ 
tion to Quipper. We then proceed with demonstrating 
how Quipper allows automated quantum-circuit genera¬ 
tion and manipulation and indeed offers a universal au¬ 
tomated LRE tool. We finally discuss how Quipper’s 
powerful capabilies have been exploited for the purpose 
of this work, namely achieving automated LRE of the 
oracles’ circuit implementations. 

A. Quipper and the Circuit Model 

The programming language Quipper 0111 is a do¬ 
main specific, higher-order, functional language for quan¬ 
tum computation. A snippet of Quipper code is es¬ 
sentially the formal description of a circuit construc¬ 
tion. Being higher-order, it permits the manipulation 
of circuits as first-class citizens. Quipper is embedded 
in the host-language Haskell and builds upon the work 

of [niiiiiiiMisi. 

In Quipper, a circuit is given as a typed procedure 
with an input type and an output type. For example, 
the Hadamard and the not-gates are typed with 

hadcunard ; ; Qubit -> Circ Qubit 
qnot ;; Qubit -> Circ Qubit 

They input a qubit and output a qubit. The keyword 
Circ is of importance: it says that when executed, the 
function will construct a circuit (in this case, a trivial 
circuit with only one gate). 

Quantum data-types in Quipper are recursively gen¬ 
erated: Qubit is the type of quantum bits; (A,B) is a 
pair of an element of type A and an element of type B; 
(A,B,C) is a 3-tuple; () is the unit-type: the type of the 
empty tuple; [A] is a list of elements of type A. 

If a program has multiple inputs, we can either place 
them in a tuple or use the eurry notation (—>■). For in¬ 
stance, the program 

prog :: (A,B,C) -> Circ D 

takes three inputs of type A, B and C and outputs a result 
of type D, while at the same time producing a circuit. 
Using the curry notation, the same program can also be 
written as 

prog :: A -> B -> C -> Circ D 

where D is the type of the output. We use the program 
by placing the inputs on the right, in order: 

prog a b c 

The meaning is the following: prog a is a function of 
type B -> C -> Circ D, waiting for the rest of the ar¬ 
guments; prog a b is a function of type C -> Circ D, 
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where 

f ^rR^o^GrT) or , 

j = l,2,-,N^ l/=(l- 4 ( 3 )r„/( 2 r 7 ) "'‘“'^" (4 4 '^^) 


FIG. 17. Definition of the unitary transformation ‘[/ 2 (timestep)’ for two different time-steps timestep G {ti,t 2 }, which are 
determined by Suzuki-Integrator constant p2 and Trotter time-splitting factor r. Uz{ti) and Uz{t2) are used to implement the 
Suzuki higher-order Integrator as part of the task of the higher-level subroutine ‘HamiltonianSimulation(x, t; OracleTl)’, see 
Fig. 1161 The implementation of the lower-level subroutine ‘HsimKernel(t, x, band, timestep, Oracle^)’ is presented in Fig. 1181 
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FIG. 18. Quantum circuit to implement the subroutine ‘HsimKernel(t, x, band, timestep, Oracle^)’, whose task is to apply a 
1-sparse Hamiltonian to the input state in register R 2 (function argument x; here represented by wires *[0],... ,a;[n 2 — 1]), 
whereby the Hamiltonian transformation is to be controlled by HS control register Ri (function argument t; represented by wires 
t[0],..., t[ni — 1]) and Oracle A is used to specify the Hamiltonian. The argument ‘band’ is an integer to denote the Hamiltonian 
band that is to be applied. The argument ‘timestep’ is a real time scale factor, which can have two values timestep G {ti,t 2 } 
(see Fig. El). Oracle A is a function ‘Oracle_H(x, y, z; band, argfiag)’ that accesses Hamiltonian bands and, depending on the 
value of the integer flag argflag G {0,1}, computes the corresponding magnitude or phase value, respectively, and stores them 
in an n 4 -qubit register z G {m, p}. Here, y is an n 2 -qubit ancilla register R \2 to hold the connected Hamiltonian node index, 
and the auxiliary n 4 -qubit registers m and p are used to store the Hamiltonian magnitude and phase value, respectively. 
These ancilla registers are initialized and terminated to states and |0)®"^, respectively. The controlled subroutine 

M :=Hmag(x, y, m, 00) is defined in Fig. 1191 and the controlled subroutine ‘C-Phase(c; 0o,/)’ is discussed in appendix | 2 g| and 
illustrated in Fig. 1371 
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program generates a circuit, and (Qubit,Qubit) indi¬ 
cates that two quantum bits are going to be returned. 
Line 2 starts the actual coding of the program. Lines 3 
to 6 are the instructions generating new quantum bits 
and performing gate operations on them, while Line 7 
states that the newly created quantum bits ql and q2 
are returned to the user. 


Line numbers: 
Circuit: 


3 

Oh 


6 

- 0 - 


Oh 


H 


7 

ql 

q2 


FIG. 19. Quantum circuit to implement the subroutine 
M :=Hmag(x, y, m, (/iq), whose application is to be controlled 
by a single-qubit t[j] that is part of the ni-qubit HS con¬ 
trol register t. Its task is to apply the coupling elements’ 
magnitude component of a 1-sparse Hamiltonian operation; 
the circuit implementation resembles a generalized quantum 
walk. Here, x and y are n 2 -qubit index registers (represented 
by wires 2 ;[ 0 ],.. ., x\n 2 — 1] and j/[0],. .., y[n 2 — 1]), respec¬ 
tively, and m is an r!, 4 -qubit register (represented by wires 
m[0],..., m[n 4 — 1]), which holds the Hamiltonian magnitude 
value. The angle denotes the minimum resolvable phase 
shift. The W gate and the controlled phase gate P{2(j)ot) are 
specified in appendix [m and 1 2 g[ respectively. 

waiting for the last argument; finally, prog a b c is the 
fully applied program. If a program has no input, it has 
simply the type Circ B if B is the type of its output. 

Using the introduced notation, we can type the 
controlled-not gate: 

controlled_not : : 

Qubit -> Qubit -> Circ (Qubit,Qubit) 

and initialization and measure: 

qinit :: Bool -> Circ Qubit 
measure :: Qubit -> Circ Bit 

To illustrate explicitly how quantum circuits are gen¬ 
erated with Quipper, let us use a well-known example: 
the EPR-pair generation, defined by the transformation 
|0) (g) |0) —>■ l/-\/2 (|0) g) |0) + |1) (g) |1)). The Quipper 
code which creates such an EPR pair can be written as 
follows: 

1 epr :: Circ (Qubit,Qubit) 

2 epr = do 

3 ql <- qinit False 

4 q2 <- qinit False 

5 q2 <- hadamard q2 

6 controlled_not ql q2 

7 return (ql,q2) 

The generated circuit is presented in Pig. [501 and each 
line is shown with its corresponding action. Line 1 de¬ 
fines the type of the piece of code: Circ means that the 


FIG. 20. EPR-pair creation; circuit generated with Quipper. 

Quipper is a higher-order language, that is, functions 
can be inputs and outputs of other functions. This al¬ 
lows one to build quantum-specific circuit-manipulation 
operators. For example, 

controlled: (Circ A) -> Qubit -> Circ A 

inputs a circuit, a qubit, and output the same circuit 
controlled with the qubit. It fails at run-time if some 
non-controllable gates were used. So the following two 
lines are equivalent: 

controlled (qnot x) y 
controlled_not x y 

The function classical_tojreversible, presented in 
Section liVDi is another example of high-level operator. 

The last feature of Quipper useful for automated gen¬ 
eration of oracles is the subroutine (or box) feature. The 
operator box allows macros at the circuit level: it allows 
re-use of the same piece of code several times in the same 
circuit, without having to write down the list of gates 
each time. When a particular piece of circuit is used sev¬ 
eral times, it makes the representation of the circuit in 
the memory more compact, therefore more manageable, 
in particular for resource estimation. 


B. Quipper-Generated Resource Estimation 

The previous section showed how a program in Quip¬ 
per is essentially a description of a circuit. The execution 
of a given program will generate a circuit, and performing 
logical resource estimation is simply achieved by complet¬ 
ing the program with a gate-count operation at the end 
of the circuit-generation process. Instead of, say, send¬ 
ing the gates to a quantum co-processor, the program 
merely counts them out. Quipper comes equipped with 
this functionality. 
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C. Regular versus Reversible Computation 

An oracle in quantum computation is a description of a 
classical structure on which the algorithm acts: a graph, 
a matrix, etc. An oracle is then usually presented in the 
form of a regular, classical function / from n to m bits 
encoding the problem. It is left to the reader to make this 
function into the unitary of Fig. 1211 acting on quantum 
bits. 


X 

V 


Uf 

— — y + fix) 


FIG. 21. General form of the oracle for a function /. 


Provided that the function / is given as a procedure 
and not as a mere truth table, there is a known efficient 
strategy to build U / out of the description of / [s^. 

The strategy consists in two steps. First, construct the 
circuit 7/ of Fig. [22l Such a circuit can be built in a 
compositional manner as follows. Suppose that / is given 
in term of g and h\ f{x) = h{g(x)). Then, provided that 
Tg and Th are already built, T/ is the circuit in Fig. [23l 
NOT and AND are enough to write any boolean function 
/: these are the base cases of the construction. The gate 
Tnot is the controlled-not, and the gate Tand is the 
Toffoli gate. 

Once the circuit Tf \s built, the circuit Uf, shown in 
Fig.lMlis simply the composition of T/, a fanout, followed 
with the inverse ot Tf. At the end of the computation, 
all the ancillas are back to 0: they are not entangled 
anymore and can be discarded without jeopardizing the 
overall unitarity ot Uf. 


D. Quipper and Template Haskell 

As the transformation sending a procedure / to a cir¬ 
cuit Tf is compositional, it can be automated. We are 
using a feature of the host language Haskell to perform 
this transformation automatically: Template-Haskell. In 
a nutshell, it allows one to manipulate a piece of code 
within the language, produce a new piece of code and 
inject it in the program code. Another (slightly mislead¬ 
ing) way of saying it is that it is a type-safe method for 
macros. Regardless, it allows one to do exactly what 
we showed in the previous section: function composition 
is transformed into circuit composition, and every sub¬ 
function f : A —^ R is replaced with its corresponding 
circuit, whose typB is A — >■ Circ B: a function that 
inputs an object of type A, builds a (piece of) circuit, 
and outputs B. For example, the code 


Technically, the type is Circ(y4 —> Circ B). But this is only an 
artifact of the mechanical encoding. 


| 0 >- 


Tf 


garbage 

fix) 


FIG. 22. Gircuit Tf. Note that the middle set of inputs are 
ancilla qubits. 



> garbage 


h{g{x)) 


FIG. 23. Composing two oracles. 


|o>— 


T/ ^= 1 back to 
J state |0) 




z + fix) 


FIG. 24. Making an oracle reversible. 


my_cLnd :: (Bool,Bool,Bool) -> Bool 
my_and (x,y,z) = x && (y && z) 

computing the conjunction of the three input variables 
X, y and z is turned into a function 

template_my_cLnd : : 

(Qubit,Qubit,Qubit) -> Circ Qubit 

computing the circuit in Fig. [23 Notice how the in¬ 
put wires are not touched and how the result is just one 
among many output wires. One can as easily encode the 
addition using binary integer. 


♦ 




^ I 






0 |-( 

)—— 


FIG. 25. Circuit mechanically generated. 

As Quipper is a high-level language, it flawlessly allows 
circuit manipulation. In particular, one can perform the 
meta-operation classical_to_reversible sending the 
circuit Tf to Uf, of type 

(A -T Circ B) -T (A, B) -T Circ (A, B), 
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FIG. 26. Circuit made reversible. 

provided that A and B are essentially lists of qubits, and 
that Tf only consists of classical reversible gates: nots, 
c-nots, cc-nots, etc. 

In the case of our my_and function, it produces the 
circuit in Fig. [221 of the correct shape. One can easily 
check that the wire out is correctly set. 

E. Encoding Oracles 

The oracles of QLSA were given to us as a set of Mat- 
lab functions as part of the lARPA QCS program GFI. 
These functions computed the matrix A and the vectors b 
and i? of 0. They were not using any particular library: 
directly translating them into Haskell was a straightfor¬ 
ward operation. As the Matlab code came with a few 
tests to validate the implementation, by running them in 
Haskell we were able to validate our translation. 

The main difficulty was not to translate the Matlab 
code into Quipper, but rather to encode by hand the real 
arithmetic and analytic functions that were used. Fig. [27] 
shows a snippet of translated Haskell code: it is a non¬ 
trivial operation using trigonometric functions. Another 
part of the oracle is also using arctan. 

To be able to be processed through Template Haskell, 
all the arithmetic and analytic operations had to be writ¬ 
ten from scratch on integers encoded as lists of Bool. 
We used an encoding on fixed-point arithmetic. Inte¬ 
gers were coded as 32-bit plus one bit for the sign, and 
real numbers as 32-bit integer part and 32-bit mantissa, 
plus one bit for the sign. We could have chosen to use 
floating-point arithmetic, but the operations would have 

calcRweights y nx ny lx ly k theta phi = 
let (xc’.yc’) = edgetoxy y nx ny in 

let xc = (xc’-1.0)*lx - ((fromintegral nx)-1.0)*lx/2.0 in 
let yc = (yc’-1.0)*ly - ((fromintegral ny)-1.0)*ly/2.0 in 
let (xg,yg) = itoxy y nx ny in 
if (xg == nx) then 

let i = (mkPolar ly (k»xc*(cos phi)))*(mkPolar 1.0 (k*yc*(sin phi)))* 

((sine (k*ly*(sin phi)/2.0)) :+ 0.0) in 
let r = ( cos(phi) :+ k*lx )*((cos (theta - phi))/lx :+ 0.0) in i • r 
else if (xg==2*nx-l) then 

let i = (mkPolar ly (k*xc*cos(phi)))*(mkPolar 1.0 (k*yc*sin(phi)))* 

((sine (k*ly*sin(phi)/2.0)) :+ 0.0) in 
let r = ( cos(phi) :+ (- k*lx))*((cos (theta - phi))/lx :+ 0.0) in i • r 
else if ( (yg==l) ftft (xg<nx) ) then 

let i = (mkPolar lx (k*yc*sin(phi)))•(mkPolar 1.0 (k*xc*cos(phi)))* 

((sine (k*lx*(cos phi)/2.0)) :+ 0.0) in 
let r = ( (- sin phi) :+ k*ly )*((eos(theta - phi))/ly :+ 0.0) in i • r 
else if ( (yg==ny) ftfe (xg<nx) ) then 

let i = (mkPolar lx (k*ye*sin(phi)))*(mkPolar 1.0 (k*xc*cos(phi)))* 

((sine (k*lx*(cos phi)/2.0)) :+ 0.0) in 
let r = ( (- sin phi) :+ (- k*ly) )*((cos(theta - phi)/ly) :+ 0.0) in i • r 
else 0.0 :+ 0.0 


FIG. 27. Small piece of oracle R code. 


been much more involved: the corresponding generated 
circuit would have been even bigger. 

We made heavy use of the subroutine facility of Quip¬ 
per: All of the major operations are boxed, that is, ap¬ 
pear only once in the internal structure representing the 
circuit. This allows manageable processing (e.g. print¬ 
ing, or resource counting). As an example, the circuit for 
Oracle R of QLSA is shown in Fig. [28l 

F. Compactness of the generated oracles 

Our strategy for generating circuits with Template- 
Haskell is efficient in the following sense: the size of 
the generated quantum circuit is exactly the same as the 
number of steps in the classical program. For example, if 
the classical computation consists of n conjunctions and 
m negations, the generated quantum circuit consists of n 
Toffoli gates and m CNOT gates. 

The advantage of this technique is that it is fully gen¬ 
eral: with this procedure, any classical computation can 
be turned into an oracle in an efficient manner. 

Optimizing oracle sizes — As we show in this paper, 
the sizes of the generated oracles are quite impressive. In 
the current state of our investigations, we believe that, 
even with hand-coding, these numbers could only be im¬ 
proved upon by a factor of 5, or perhaps at most a factor 
of 10. We think that accomplishing a greater reduction 
beyond these moderate factors would require a drastic 
change in the generation approach and techniques. 

The reason why we think it is possible to achieve the 
mentioned moderate optimization is the following. Al¬ 
though the oracles we deal with in this work are speci¬ 
fied and tailored to the particular problem we have been 
analyzing, they are also general in the sense that they 
are made of smaller algorithms (e.g. adders, multipli¬ 
ers ...). The reversible versions of these algorithms have 
been studied for a long time, and quite efficient proposals 
have been made. An analysis of the involved resources 
shows that for the addition of n-bit integers, the number 
of gates involved in the automatically-generated adder 
gate Tf is < 25n and the number of ancillas is < 8n. A 
hand-made reversible adder can be constructed [s^ with 
respectively < 5n gates and < n ancillas. If one found a 
way to reuse these circuits in place of our automatically 
generated adders, it would reduce the oracle sizes. How¬ 
ever, it could only do so by a relatively small factor; the 
total number of gates would still be daunting. 

Despite this drawback, our method is versatile and able 
to provide circuits for any desired function / without 
further elaborate analysis. 

V. RESULTS 

Our LRE for QLSA for problem size N = 332, 020,680 
is summarized in Table im The following comments ex¬ 
plain this table and our assumptions. 
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FIG. 28. Oracle R, automatically generated. In the on-line version of the paper the reader can magnify the PDF image to see 
the details of the circuit. For display purposes, in this figure we use one wire for integers and two wires for real numbers. Only 
the main structure is shown: all operations such as tests, arithmetic and analytic operations only appear as named boxes. 


Resources 

inch oracles 

excl. oracles 

Max. overall number of qubits 

in use at a time 

3 X 10® 

341 

Max. number of data 

qubits at a time 

60 

60 

Max. number of ancilla 

qubits in use at a time 

3 X 10® 

281 

Overall number of ancilla 

generation-use-termination cycles 

2.8 X lO’^^ 

8.2 X lO’^^ 

Total number of gates 

2.37 X lO’^® 

3.34 X 10^® 

# H gates 

2.7 X 10^® 

1.20 X 10®® 

# S gates 

1.4 X lO’^® 

6.3 X 10®-* 

# T gates 

9.5 X 10^® 

1.29 X 10®® 

# X gates 

1.6 X lO’^® 

2.0 X 10®® 

# Z gates 

2 X 10^® 

2.4 X 10®® 

# CNOT gates 

8.5 X lO’^® 

1.7 X 10®"* 

Circuit Width 

3 X 10® 

341 

Circuit Depth 

1.8 X lO’^® 

3.30 X 10®® 

T-Depth 

8.2 X 10^® 

1.28 X 10®® 

Measurements 

2.8 X 10’^’’ 

8.23 X 10®^ 


TABLE II. QLSA resource requirements for problem size N = 
332, 020, 680 and algorithmic accuracy e = 0.01. 


Unlike with QEC protocols where the distinction be¬ 
tween ‘data qubits’ and ‘ancilla qubits’ is clear, here this 
distinction is somewhat ambiguous; indeed, all qubits in¬ 
volved in the algorithm are initially prepared in state |0), 
and some qubits that we called ancilla qubits exist from 
the start to the end of a full quantum computation part 
(such as e.g single-qubit registers Re, Rs). We regard 
qubits which carry the data of the linear system problem 
and store its solution at the end of the quantum compu¬ 
tation as data qubits; they constitute the quantum data 
registers i ?2 and R 3 , see Table HI All other qubits, in¬ 
cluding those of QAE and HS control registers Rq and 


Ri as well as of the computational registers R 4 and Re, 
are considered ancilla qubits. 

It is important to note that the overall QLS algo¬ 
rithm consists of four independent quantum computation 
parts, namely the four calls of ‘AmpEst’ subalgorithms, 
see Fig. [2 while the top-level function ‘ QLSA^main’ per¬ 
forms a classical calculation of the RCS (by Eq. (ITBll l 
using the results (/>&, (fx, <fr 0 , 4>ri of its four quantum 
computation parts. These four independent ‘AmpEst’ 
subalgorithms can either be performed in parallel or se¬ 
quentially, and the actual choice should be subject to any 
time/space tradeoff considerations. Here we assume a se¬ 
quential implementation, so that data and ancilla qubits 
can be reused by the four amplitude estimation parts. 
Hence, the qubit counts provided in Table HIl represent 
the maximum number of qubits in use at a time required 
by the most demanding of the four independent ‘Am¬ 
pEst’ subalgorithms. The maximum overall number of 
qubits (data and ancilla) in use at a time is also the def¬ 
inition for circuit width. While with a sequential imple¬ 
mentation we aim at minimizing the circuit width (space 
consumption), we can do so only at the cost of increasing 
the circuit depth (time consumption). The overall circuit 
depth is the sum of the depths of the four ‘AmpEst’ sub¬ 
algorithms. By a brief look at Fig. it is clear that the 
circuit depths are similarly large for ‘AmpEst_(/)a;’ and 
‘AmpEst-^r’ (where the latter is called twice), whereas 
compared to these the circuit depth of ‘AmpEst_())f,’ is 
negligible. Hence the overall circuit depth is rougly three 
times the circuit depth of subalgorithm ‘AmpEst_(/)r’. We 
could just as well assume a parallel implementation of the 
four ‘AmpEst’ calls. In this case the overall circuit depth 
would be by a factor 1/3 smaller than in the former case. 
However, this circuit-depth decrease can only be achieved 
at the cost of incuring a circuit-width increase. We would 
need up to four copies of the quantum registers listed in 
Table n and the required number of data and ancilla 
qubits in use at a time would be larger by a factor that 
is somewhat smaller than four. 

QLSA has numerous iterative operations (in particu- 
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lar due to Suzuki-Higher-Order Integrator method with 
Trotterization) involving ancilla qubit ‘generation-use- 
termination’ cycles, which are repeated, over and over 
again, while computation is performed on the same end- 
to-end data qubits. Table El provides an estimate for 
both the number of ancilla qubits employed at a time 
and for the overall number of ancilla generation-use- 
termination cycles executed during the implementation 
of all the four ‘AmpEst’ subalgorithms. To illustrate the 
difference we note that, for some quantum computer re¬ 
alizations, the physical information carriers (carrying the 
ancilla qubits) can be reused, for others however, such as 
photon-based quantum computer realizations, the infor¬ 
mation carriers are lost and have to be created anew. 

Furthermore, the gate counts actually mean the num¬ 
ber of elementary logical gate operations, independent of 
whether these operations are performed using the same 
physical resources (lasers, interaction region, etc.) or not. 
The huge number of measurements results from the vast 
overall number of ancilla-qubit uses; after each use an 
ancilla has to be uncomputed and eventually terminated 
to ensure reversibility of the circuit. Finally, Table |TT] 
distinguishes between the overall LRE that includes the 
oracle implementation and the LRE for the bare algo¬ 
rithm with oracle calls regarded as ‘for free’ (excluding 
their resource requirements). 


VI. DISCUSSION 

A. Understanding the resource demands 

Our LRE results shown in Table HIl suggest that the re¬ 
source requirements of QLSA are to a large extent domi¬ 
nated by the quantum-circuit implementation of the nu¬ 
merous oracle A queries and their associated resource 
demands. Indeed, accounting for oracle implementation 
costs yields resource counts which are by several orders of 
magnitude larger than those if oracle costs are excluded. 
While Oracle A queries have only slightly lower imple¬ 
mentation costs than Oracle b and Oracle R queries, it 
is the number of queries that makes a substantial differ¬ 
ence. As clearly illustrated in Fig. [5J Oracle A (required 
to implement the Hamiltonian transformation with 
t < to ^ 0{K/e)) is queried by many orders of magnitude 
more frequently than Oracles b and R, which are needed 
only for preparation of the quantum states |6) and |i?) 
corresponding to the column vectors b, R S C^. Hence, 
the overall LRE of the algorithm depends very strongly 
on the Oracle A implementation. However, note that Or¬ 
acles b and R contribute most to circuit width due to the 
vast number of ancilla qubits (~ 3 x 10®) they employ at 
a time, see Table E in appendix 131 

The LRE for the bare algorithm, i.e., with oracle 
queries and ‘Integerinverse’ function regarded as ‘for free’ 
(excluding their resource costs), amounts to the order of 
magnitude 10^® for gate count and circuit depth — still 
a surprisingly high number. In what follows, we explain 


how these large numbers arise, expanding on all the fac¬ 
tors in more detail that yield a significant contribution 
to resource demands. To do so, we make use of Fig. [31 

QLSA’s LRE is dominated by series of nested loops 
consisting of numerous iterative operations, see Fig. [2l 
The major iteration of circuits with similar resource de¬ 
mands occurs due to the Suzuki-Higher-Order Integra¬ 
tor method including a Trotterization with a large time¬ 
splitting factor of order 10^^ to accurately implement 
each run of the HS as part of QPEA. Indeed, each sin¬ 
gle call of ‘HamiltonianSimulation’ yields the iteration 
factor r = 2.5 x 10^^. This subroutine is called twice 
during the ‘Solve_x’ procedure, and the latter is further¬ 
more employed twice within the (controlled) Grover Iter¬ 
ators in three of the four QAEAs. There are “ 

2"o — I = 16383 controlled Grover Iterators employed 
within each of the four QAEAs. Hence, the ‘Hamiltoni¬ 
anSimulation’ subroutine is employed (2"° — I)x4x3 = 
196596 Ri 2 X 10® number of times altogether. Because 
each of its calls uses Trotterization with time-splitting 
factor 2.5 x 10®^ and a Suzuki-Higher-Order Integrator 
decomposition with order k = 2 involving a further ad¬ 
ditional factor 5, we already get the factor ~ 2.5 x 10®®. 
Moreover, the lowest-order Suzuki operator is a prod¬ 
uct of 2 X Vb = 18 one-sparse Hamiltonian propagator 
terms (where Nb = 9 is the number of bands in matrix 
A); each such term calls the ‘HsimKernel’ function, with 
‘band’ and ‘timestep’ as its runtime parameters. In addi¬ 
tion, each call of HsimKernel employs Oracle A six times 
and furthermore involves 24 applications of the proce¬ 
dure ‘Hmag’ controlled by the time register i?i. Thus, 
in total QLSA involves 6 x 18 x 2.5 x 10®® ~ 2.7 x 10^° 
Oracle A queries and 24 x 18 x 2.5 x 10®® ~ 10^® calls of 
controlled Hmag. Hence, even if subroutine Hmag con¬ 
sisted of a single gate and oracle A queries were for free, 
we would already have approx. 10^® for gate count and 
circuit depth. 

However, Hmag is a subalgorithm consisting of further 
subcircuits to implement the application of the magni¬ 
tude component of a particular one-sparse Hamiltonian 
term to an arbitrary state. It consists of several W 
gates, Toffolis and controlled rotations. Hence, a further 
increase of the order of magnitude is incurred by vari¬ 
ous decompositions of multi-controlled gates and/or rota¬ 
tion gates into the elementary set of fault-tolerant gates 
{H,S,T,X,Z,CNOT}, using the well-known decompo¬ 
sition rules outlined in appendix [31 (e.g., optimal-depth 
decompositions for Toffo li f3p a nd for controlled single¬ 
qubit rotations [H, HO, l4lL l42j|'). In our analysis, this 
yields a further factor ^10'®. Thus, even if we exclude 
oracle costs, we have 10^® x 10® = 10^® for gate count and 
circuit depth for the bare algorithm, simply because of a 
large number of iterative processes (due to Trotterization 
and Grover-iterate-based QAE) combined with decompo¬ 
sitions of higher-level circuits (such as multi-controlled 
NOTs) into elementary gates and single-qubit rotation 
decompositions (factors ^ 10^ — 10®). 

If we include the oracle implementation costs, the dom- 
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inant contribution to LRE is that of Oracle A calls, be¬ 
cause oracle A is queried by a factor ^ 10^^ more fre¬ 
quently than Oracle b and even by a larger factor than 
Oracle R. Each Oracle A query’s circuit implementation 
has a gate count and circuit depth of order ^ 2.5 x 10®, see 
appendix[3l Having approx. 2.7 x 10^° Oracle A queries, 
the LRE thus amounts to the order of magnitude ^ 10^®. 

Let us briefly summarize the nested loops of QLSA 
that dominate the resource demands, while other compu¬ 
tational components have negligible contributions. The 
dominant contributions result from those series of nested 
loops which include Hamiltonian Simulation as the most 
resource-demanding bottleneck. The outer loops in these 
series are the hrst-level QAEA subroutines to find esti¬ 
mates for (px, (j>rO and cpri, each involving 2”° — 1 = 16383 
controlled Grover iterators. Each Grover iterator in¬ 
volves several implementations of Hamiltonian simula¬ 
tion based on Suzuki higher-order Integrator decompo¬ 
sition and Trotterization with r « 10^^ time-splitting 
slices. Each Trotter slice involves iterating over each ma¬ 
trix band whereby the corresponding part of Hamilto¬ 
nian evolution is applied to the input state. Einally, for 
each band several oracle A implementations are required 
to compute the corresponding matrix elements, which 
moreover employs several arithmetic operations, each of 
which themselves require loops with computational effort 
scaling polynomially with the number of bits in precision. 


B. Comparison with previous ‘big-O’ estimations 

As pointed out in the Introduction, we provide the 
first concrete resource estimation for QLSA in contrast to 
the previous analyses ii which estimated the run-time 
of QLSA only in terms of its asymptotic behavior using 
the ‘big-O’ characterization. As the latter is supposed 
to give some hints on how the size of the circuit evolves 
with growing parameters, it is interesting to compare our 
concrete results for gate count and circuit depth with 
what one would expect according to the rough estimate 
suggested by the big-O (complexity) an^sis. The big-O 
estimations proposed by Harrow et al. Q and Glader et 
al. @ have been briefly discussed in the Introduction and 
are given in Eqs. (ED and ©, respectively. 

Gomplexity-wise, the parameters taken into account 
in the big-O estimations are the size N of the square 
matrix A, the condition number k of A, the sparseness d 
which is the number of non-zero entries per row/column 
in A, and the desired algorithmic accuracy given as error 
bound e. The choice of parameters made in this paper 
fixes these values to = 332 , 020 , 680 , k = 10 "^, d = 7, 
and e = 10“^. If one plugs them into Eqns. ED and ©, 
one gets respectively ^ 4 • 10 ^^ and ^ 2 • 10 ^^. 

Although these numbers are large, they are not even 
close to compare with our estimates. This is due to the 
way a big-O estimate is constructed: it only focuses on 
a certain set of parameters, the other ones being roughly 
independent of the chosen set. Indeed, the ‘function’ 


provided as big-O estimate is only giving a trend on how 
the estimated quantity behaves as the chosen set of pa¬ 
rameters goes to infinity (or to zero, in the case of e). 
Hence, only the limiting behavior of the estimate can be 
predicted with high accuracy, when the chosen relevant 
parameters it depends on tend towards particular values 
or infinity, while the estimate is very rough for other val¬ 
ues of these parameter. In particular, a big-O estimate 
is hiding a set of constant factors, which are unknown. 
In the case of QLSA, our LRE analysis does not reveal a 
trend, it only gives one point. Nonetheless, it shows that 
these factors are extremely large, and that they must be 
carefully analyzed and otherwise taken into account for 
any potentially practical use of the algorithm. 

Although the (unknown) constant factors implied by 
big-O complexity cannot be inferred from our LRE re¬ 
sults obtained for just a single problem size, we can 
nevertheless consider which steps in the algorithm are 
likely to contribute most to these factors. With our 
fine-grained approach we found that, if excluding the 
oracle A resources, the accrued circuit depth ~ 10^® is 
roughly equal to 3 x (2"“ — I) Grover iterations (as part 
of amplitude estimation loops for (prO and (pri) times 
4 X {2Nf)) X 5 X 2.5 x 10^^ for the number of exponen¬ 
tials needed to implement the Suzuki-Trotter expansion 
(as part of implementing HS, which is employed twice in 
Solve_a; that is again employed twice in each Grover itera¬ 
tor) times a factor ^ 2.4 x 10® coming about from the cir¬ 
cuits to implement, for each particular Aj in the decom¬ 
position [Eq. (jS])] , the corresponding part of Hamiltonian 
state transformation. In terms of GJS big-O complexity 
the circuit depth is O log(iV)/e^), which comes from 
O (1/e) QAE Grover iterationsI3 times O i^d'^n/e) expo¬ 
nential operator applications to implement the Suzuki- 
Trotter expansion PA times O(logiV) oracle A queries 
to simulate each query to any Aj in the decomposition 
[Eq. ([SD], times the overhead of 0{d^) computational 
steps including 0{d‘^) oracle A queries to estimating the 
preconditioner M of the linear system in order to pre¬ 
pare the preconditioned state M\b), see Here it is 
appropriate to note though that the HHL and CJS run¬ 
time complexities given in Eqs. o and ([2|), respectively, 
neglect more slowly-growing terms, as indicated by the 
tilde notation O(-). However, in a comparison with our 
empirical gate counts we ought to also take those slowly- 
growing terms into account. Eor instance, there is an¬ 
other factor of (Atd^/e^)^/"^ Ri 3 x 10^ contributing to the 


However, see our remarks in footnotes O and E] in which we 
pointed out that 0(«:/e) may be a more appropriate estimate for 
the complexity of the QAE loops. 

For a d-sparse A, simulating exp(zAf) with additive error e 
using HS techniques Q requires a runtime proportional to 
see ii- It is performing the phase 
estimation (as part of ‘Solve_fc’), which is the dominant source 
of error, that requires to take to = 0{K,/e) for the various times 
t = rtojT defining the HS control register in order to achieve a 
final error smaller than e, see 
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number of Suzuki-Trotter expansion slices, which was ig¬ 
nored in the O notation for HHL and CJS complexities, 
while it was accounted for in our LRE. By inspecting and 
comparing (CJS big-0 vs. our LRE) the orders of mag¬ 
nitude of the various contributing terms, we conclude 
that the big-0 complexity is roughly two orders of mag¬ 
nitude off (smaller) from our empirical counts for the 
Suzuki-Trotter expansion step. As for the QAE steps, 
our LRE count is ^ 5 x 10"^, which is roughly two or¬ 
ders of magnitude higher than 0(l/e) and smaller than 
0(«;/e), suggesting that 0(l/e) is too optimistic while 
O^n/e) is too conservative. Finally, the big-0 complex¬ 
ity misses roughly 5 orders of magnitude that our fine¬ 
grained approach reveals for the circuit-implementation 
of the Hamiltonian state transformation for each Aj at 
the lowest algorithmic level. 

In order to understand what caused such large constant 
factors, we estimated the resources needed to run QLSA 
for a smaller problem sizj^ while keeping the same preci- 
son (and therefore the same size for the registers holding 
the computed values). Specifically, we chose N = 24, 
while we kept the condition number and the error bound 
at the same values k = 10^ and e = 10“^, respectively. 
Despite the fact that the matrix A lost several orders 
of magnitude in size, the circuit width and depth ended 
up being of roughly the same order of magnitude as of 
Table in 

What our results suggest is that the large constant fac¬ 
tors arise as a consequence of the desired precision forcing 
us into choosing large sizes for the registers, whereas the 
LRE is not notably impacted by a change in problem size 
N. This can intuitively be understood as follows. First, 
the total number of gates required for QLSA’s non-oracle 
part scales as 0{\ogN), cf. Eq. (|3]); hence, using = 24 
in place of = 332, 020,680 suggests an LRE reduction 


A smaller problem size is obtained by reducing the spatial do- 
main size of the electromagnetic scattering FEM simulation, 
via reductions in parameters rix and Uy which represent the 
number of FEM vertices in x and y dimensions. The imme¬ 
diate consequence is a reduction of the common length of quan¬ 
tum data registers R2 and R3, i.e., n2 = |’log2(2A^)], where 
N = nx(ny — 1 ) -h [ux — Such register-length reduction 

is expected to affect the resource requirements for all oracles as 
well as all subroutines that involve the data registers R2 and 
Rz. In fact, the input registers to all oracles are of length n2, 
and shortening them has the potential of reducing the oracle 
sizes. However, we recounted oracles’ resources using Quipper, 
with 712 = 6 in place of n2 = 30 , and found that the only dif¬ 
ference involves the number of ancillas and measurements re¬ 
quired. When checking the resource change of the entire QLSA 
circuit, we found negligible difference. Indeed, changes in n2 
have a relatively little effect on resources of the bare algorithm 
(excluding oracle costs), because the dominant contribution to 
resources in the non-oracle part is given by the time-splitting 
factor imposed by Hamiltonian-evolution simulation, which does 
not directly depend on 712. Besides, since the total number of 
operations required for QLSA’s non-oracle part has a complex¬ 
ity that scales logarithmically in A, see Eqs. and m, the 
resources for n2 = 6 in place of n2 = 30 are expected to diminish 
by just a relatively small factor ~ 5 . 


only by a moderate factor ~ 5. Secondly, what mat¬ 
ters for the LRE of oracles is also mostly determined 
by the desired accuracy e. Each oracle query essentially 
computes a single (complex) value corresponding to a 
particular input from the set of all inputs. The oracles 
are oblivious to the problem size and to the actual value 
of each of their inputs. While oracles obtain actual in¬ 
put data from the data register i ?2 or i? 3 , whose size 
1^2 = «3 = log2(2V) clearly depends on N, these are not 
the ones that crucially determine the oracles’ sizes. What 
virtually matters for the size of the generated quantum 
circuit implementing an oracle query, is the size of the 
computational registers i ?4 and i ?5 used to compute and 
hold the output value of each particular oracle query. In 
our analysis, these registers have size n 4 = 65, cf. Table U 
they were kept at the same size when computing QLSA’s 
LRE for the smaller problem size N = 24. 

C. Lack of parallelism 

Comparing the estimates for the total number of gates 
and circuit depth reveals a distinct lack of parallelisrt^^ 
in the design of QLSA. As explained earlier, due to the 
highly repetitive structures of the algorithm primitives 
used, most of the gates have to be performed sequentially. 
Indeed, QLSA involves numerous iterative operations. 
The major iteration of circuits with similar resource re¬ 
quirements occurs due to the Suzuki-Higher-Order Inte¬ 
grator method that also involves Trotterization, which 
uses a large time-splitting factor of order 10^^ to accu¬ 
rately implement each run of the Hamiltonian-evolution 
simulation. In fact, the iteration factor imposed by Trot¬ 
terization of the Hamiltonian propagator is currently a 
hard bound on the overall circuit depth and even the to¬ 
tal LRE of QLSA, and it crucially depends on the aimed 
algorithmic precision e. The remarks in the following 
paragraph expand on this issue in more detail. 


D. Hamiltonian-evolution simulation as the actual 
bottleneck and recent advancements 

It is worth emphasizing that the quantum-circuit im¬ 
plementation of the Hamiltonian transformation us¬ 
ing well-established HS techniques Q constitutes the ac¬ 
tual bottleneck of QLSA. Indeed, this step implies the 
largest contribution to the overall circuit depth; it is 
given by the factor r x 5^“^ x {2Ni,), see Fig. [21 which is 
imposed by the Suzuki-Higher-Order Integrator method 


One can get a sense of the amount of parallelism of the overall 
circuit by comparing the total number of gates of an algorithm 
to its circuit depth. In our analysis, they only differ by a factor 
of ~ 1.33 if oracles are included, and by a factor of ~ 1.01 if 
oracles are excluded, thus most of the gates must be being applied 
sequentially. 
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together with Trotterization. According to Eq. (El) and 
the discussion following it, r ^ O . 

Thus, the key dependence of the time-splitting factor r is 
on the condition number k and the error bound e rather 
than on problem size N. The dependence on the latter 
enters only through the number of bands Nb (in the gen¬ 
eral case, the number m of submatrices in the decompo¬ 
sition [Eq. (HI)]), which can be small even for large matrix 
sizes, as is the case in our example. This feature explains 
why we can get similar LRE results for N = 332,020, 680 
and iV = 24 if K and e are kept at the same values for both 
cases and the number of bands Nb is small (see above). 

It is also important to note that there has been signif¬ 
icant recent progress on improving HS techniques. Berry 
et al. (4^ provide a method for simulating Hamiltonian 
evolution with complexity polynomial in log(l/e) (with 
e the allowable error). Even more recent works by Berry 
et al. [13, 113 improve upon results in [d^ providing a 
quantum algorithm for simulating the dynamics of sparse 
Hamiltonians with comple xity sublogarithmic in the in¬ 
verse error. Compared to [44l| . the analysis in [2^ yields 
a near-linear instead of superquadratic dependence on 
the sparsity d. Moreover, unlike th e ap proach (43l |. the 
query complexities derived in [4J, are shown to be 
independent of the number of qubits acted on. Most 
importantly, all three approaches [H, [d^, provide 
an exponential improvement upon the well-established 
method 0 that our analysis is based oiEI. To account 
for these recent achievements, we estimate the impact 
they may have with reference to the baseline imposed 
by our LRE results. The modular nature of our LRE 
approach allows us to do this estimation. The following 
back-of-the-envelope evaluation shows that, for e = 0.01, 
the advanced HS approaches [d^, [3 and [d^ may of¬ 
fer a potential reduction of circuit depth and overall gate 
count by orders of magnitude 10^, ^ 10^ and ^ 10®, 
respectively. 

Indeed, let us compare the scalings of the total num¬ 
ber of one-sparse Hamiltonian-evolution terms required 
to approximate to within error bound e = 0.01 for 
the prio r ap proach Q (used here) and the recent meth¬ 
ods [4^ [43. In doing so, we arrive at contrasting 


vs. 

or 
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for the three approaches 0, [d^ and [d^, respectively. 
In the first term, m denotes the number of submatrices in 
the decomposition [Eq. ([8])] ; in the general case, m = , 


The recently published advanced HS approaches [^. I44l . l4sll that 
promise more resource-efficient computation were not available 
at the time when our detailed implementation of QLSA and the 
corresponding LRE analysis (for which we used the previously 
published HS techniques i) were performed. 


in our toy-problem analysis, m = Nb- In the second and 
third term, d is the sparsity of A, and n is the number 
of qubits acted on, while c is a constant. In all three ex¬ 
pressions, II A|| is the spectral norm of the Hamiltonian A, 
which in our toy-problem example is time-independent. 
As stated in Sec. IHI D 51 for QLSA to be accurate within 
error bound e, we must have ||A||t ^ 0{K/e), cf. 0. Us¬ 
ing ||A||t < ||A||to = 7 X n/e and the parameter values 
m = IVb = 9, fc = 2, d = 7, n = 712 = 30 and c > 1, 
expression ED yields ^ 7 x IQi®, whereas the query 
complexity estimates (HOD and (1^ yield > 5 x 10^^ and 
^ 5 X 10®j_respectively. Hence, notably the advanced re¬ 
sults in [45l | imply that an improvement of our LRE by 
order of magnitude ^ 10® seems feasible. 


VII. CONCLUSION 

A key research topic of quantum computer science is 
to understand what computational resources would actu¬ 
ally be required to implement a given quantum algorithm 
on a realistic quantum computer, for the large problem 
sizes for which a quantum advantage would be attain¬ 
able. Traditional algorithm analyses based on big-0 com¬ 
plexity characterize algorithmic efficiency in terms of the 
asymptotic leading-order behavior and therefore do not 
provide a detailed accounting of the concrete resources re¬ 
quired for any given specific problem size, which however 
is critical to evaluating the practicality of implementing 
the algorithm on a quantum computer. In this paper, we 
have demonstrated an approach to how such a concrete 
resource estimation can be performed. 

We have provided a detailed estimate for the logical 
resource requirements of the Quantum Linear System 
algorithm, which under certain conditions solves a lin¬ 
ear system of equations. Ax = b, exponentially faster 
than the best known classical method. Our estimates 
correspond to the explicit example problem size beyond 
which the quantum linear system algorithm is expected 
to run faster than the best known classical linear-system 
solving algorithm. Our results have been obtained by a 
combination of manual analysis for the bare algorithm 
and automated resource estimates for oracles generated 
via the quantum programming language Quipper and its 
compiler. Our analysis shows that for a desired calcula¬ 
tion precision accuracy e = 0.01, an approximate circuit 
width 340 and circuit depth of order 10^® are required if 
oracle costs are excluded, and a circuit width and circuit 
depth of order 10® and lO^®, respectively, if the resource 
requirements of oracles are taken into account, showing 
that the latter are substantial. We stress once again that 
our estimates pertain only to the resource requirements 
of a single run of the complete algorithm, while actually 
multiple runs of the algorithm are necessary (followed by 
sampling) to produce a reliable accurate outcome. 

Our LRE results for QLSA are based on well- 
established quantum computation techniques and prim¬ 
itives P, d, 0, H, 113 as well as our approach to im- 
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plement oracles using Quipper. Hence, our estimates 
strongly rely on the efficiency of the applied methods 
and chosen approach. Improvement upon our estimates 
can only be achieved by advancements enabling more ef¬ 
ficient implementations of the utilized quantum compu¬ 
tation primitives and/or oracles. For example, as pointed 
out in Sec. ED most recent advancements of Hamiltonian- 
evolution simulation techniques suggest that a sub¬ 
stantial reduction of circuit depth and overall gate count 
by order of magnitude ^ 10® seems feasible. Like¬ 
wise, more sophisticated methods to generate quantum- 
circuit implementations of oracles more efficiently may 
become available. We think though that significant im¬ 
provements are going to come from inventing a better 
QLS algorithm, or more resource-efficient Hamiltonian- 
evolution simulation approaches, rather than from im¬ 
provements to Quipper. While we believe that our es¬ 
timates may prove to be conservative, they yet provide 
a well-founded ‘baseline ’ for research into the reduction 
of the algorithmic-level minimum resource requirements, 
showing that a reduction by many orders of magnitude 
is necessary for the algorithm to become practical. Our 
modular approach to analysis of extremely large quantum 
circuits reduces the cost of updating the analysis when 
improved quantum-computation techniques are discov¬ 
ered. 

To give an idea of how long the algorithm would have to 
run at a minimum, let us suppose that, in the ideal case, 
all logic gates take the same amount of time r, and have 
perfect performance thus eliminating the need for QC 
and/or QEC. Then for any assumed gate time r, one can 
calculate a lower limit on the amount of time required for 
the overall implementation of the algorithm. For exam¬ 
ple, if r = Ins (which is a rather optimistic assumption; 
for other gate duration assumptions, one can then plug 
in one’s own assumptions), a circuit depth of order 10^® 
(10^®) would correspond to a run-time approx. 3 x 10® 
(3 X 10^^) years, which apparently compares with or even 
exceeds the age of the Universe (estimated to be approx. 
13.8 X 10® years). Even with the mentioned promising 
improvements by a factor ^ 10® for the Hamiltonian- 
evolution simulation and by a factor ~ 10 for the oracle 
implementations, we would still deal with run-times ap¬ 
prox. 3 X 10® (3 X 10®) years. 

Although our results are surprising when compared to 
a naive anaWsis of the previous big-0 estimations of the 
algorithm Q, the difference can be explained by the 
factors hidden in the big-0 estimation analyses: we infer 
that these factors come for the most part from the large 
register sizes, chosen because of the desired precision. 

The moral of this analysis is that quantum algorithms 
are not typically designed with implementation in mind. 
Considering only the overal coarse complexity of a given 
algorithm does not make it automatically feasable. In 
particular, our analysis shows that book-keeping param¬ 
eters such as the size of registers have to be considered. 

Our analysis highlights an avenue for future research: 
quantum programming languages and formal methods. 


In computer science, mature techniques have been de¬ 
veloped for decades, and we ought to adapt and imple¬ 
ment them for a fine-grained analysis of quantum algo¬ 
rithms to pinpoint the various parameters in play and 
their relationships. In particular, these techniques may 
also allow to explicitly identify the actual bottlenecks of a 
particular implementation and provide useful insights on 
what to focus on for optimizations: in the case of QLSA, 
for instance, the Hamiltonian-evolution simulation and 
oracle implementations. Combining a fine-grained ap¬ 
proach with asymptotic big-0 analysis, a much fuller 
understanding of the bottlenecks in quantum algorithms 
emerges enabling focused research on improved algorith¬ 
mic techniques. 
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APPENDIX 

1. Single-qubit unitaries in terms of pre-specified 
elementary gates 

a. Implementation according to work by A. Fowler 

To convert any single-qubit unitary to a circuit in 
terms of a pre-specified set of gates {X, Y, Z, H, S,T}, we 
could use the famous Solovay-Kitaev algorithm, see, e.g., 
[D] and references therein. However, this work can result 
in unnecessarily long global phase correct approximat¬ 
ing sequences, since the trace-norm used in the Solovay- 
Kitaev theorem does not ignore global phases. Some op¬ 
timizations of the Solovay-Kitaev algorithm are possible, 
see e.g. [i^. For the single-qubit rotation gates, we base 
our estimates on work by A. Fowler (see [39|, p.l25 and 
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[40l|'). This work constructs optimal fault-tolerant ap¬ 
proximations of single-qubit phase rotation gates 





( 22 ) 


Fowler shows that a phase rotation by an angle of 7r/128 
can be approximated by a sequence of fault tolerant gates 
with a distance measure 


In other words, to obtain a distance 5 on average, we need 
on average I = gates. For S = 7.5x 10“"*, we 

obtain I = 50.69. Compare this to the exact result I = 46 
for Also, we note that 46 Gj gates correspond 

to 59 elementary gates, of which 23 are T gates. For 51 
Gj gates, we would get 26 T gates, 26 H gates and 14 S 
gates by extrapolation, for a total of 65 gates. 


b. Plato implementation of gate sequence approximations 


dist 128 , Fbe) 


m — 


"^^(-^ 1 / 128 ^ 46 ) 


< 0.01 


by choosing Uie as follows: 


7.5 X 10"^ 
(23) 


We have implemented a combination of Fowler’s 
method and the more recent single-qubit ‘normal form’ 
representation by Matsumoto and Amano (dll . 1^ in 
Haskell, to find approximating sequences. With this 
Haskell implementation, for example, we found an ap¬ 
proximating sequence for i? 7 r /256 with distance 6 = 3.6 x 
10“"^, and with sequence length 74: 


U46 = HTHTHT{SH)THT{SH)T{SH)T{SH)THT 
{SH)T{SH)THTHT{SH)T{SH)THT{SH)T 
{SH)T{SH)THT{SH)THT{HS'<)T (24) 

This sequence contains 23 H gates, 23 T (tt/S) gates and 
13 S or gates. In general, the approximating sequence 
is of the form GiTGjT, where Gi,Gj G G, a precom¬ 
puted set of gates, which together with the Identity gate I 
form a group under multiplication {/, Gi, G 2 ,..., G 23 }. 
Here, Gi = H, G 2 = X, G 3 = Z, G 4 = S, G 5 = 

Ge = XH, G 7 = ZH, Gs = SH, Gg = S'^H, Gio = ZX, 
Gii = SX, Gi 2 = S'1'X, Gi 3 = HS, Gi4 = 

Gi5 = ZXH, Gi6 = SXH, Gi7 = S'^XH, G 18 = HSH, 
Gig = HS^H, G 20 = HSX, G 21 = HS^X, G 22 = 
S'^HS, G 23 = SHS"^. To represent the complete set of 
approximating sequences, Fowler includes G 24 = T. 

The sequence given in Eq. ([24)l contains 46 Gj gates. 
The number of T gates is 23, or half the length of the ap¬ 
proximating sequence in terms of Gj gates. The number 
of H gates in this particular sequence is also 23, and the 
rest of the 59 elementary gates are S (or 5'!') gates. 

Fowler also investigated the approximation of arbitrary 
single qubit gates 

TT _ f cos( 0 / 2 )e*(“+^)/^ sin( 0 / 2 )e*^““^)/^\ , . 

y— sin( 0 / 2 )e**^“““''^^/^ cos( 0 / 2 )e®^““''^^/^y 

by sequences of gates from the group Q. 1000 random ma¬ 
trices were chosen, with a, (3 and 9 chosen uniformly in 
[0,27r). Optimal approximations Ui were constructed for 
each random matrix, and a line was fitted to the average 
distance dist{U, Ui) plotted for each 1. Fowler obtained 
the following fit for the average number I of single-qubit 
fault-tolerant gates required to obtain a fault-tolerant ap¬ 
proximation of an arbitrary single-qubit unitary to within 
the distance: 

(5 = dist(t7, Ui) = 0.292 x iQ-o osn-' . (26) 


Rn/256 ~ SHTHTHTSHTHTSHTHTSHTSHTSHT 
X HTHTHTSHTSHTSHTHTSHTHTSH 
xTHTSHTSHTSHTHTHTHTSHTHSS. 

This sequence consists of 28 (37.8%) T gates, 29 (39.2%) 
H gates, and 17 (23%) S gates. Smaller rotations tend 
to need longer sequences to reach the distance threshold 
S and/or improve on the identity as best approximation. 
Because our search algorithm used to find the approxi¬ 
mating sequences, like Fowler’s method, has exponential 
running time, finding a specific sequence to approximate 
a specific arbitrary rotation is not always feasible. Recent 
progress on this topic aiming at o ptim al-depth single¬ 
qubit rotation decompositions [U l48l . [Hfil . IHH 
highlights the importance of this problem for quantum 
computing. 

For our QLSA LRE we have made the following sim¬ 
ple (and rather pessimistic) assumption: namely, that 
any arbitrary single-qubit rotation gate (a large number 
of such gates, with various angles of rotation, occurs in 
the implementation of QLSA) can be approximated us¬ 
ing approx. 100 fault-tolerant gates from the standard set 
{X,Y, Z, H, S,T} while also achieving the desired level 
of algorithmic accuracy (e = 0.01). This approximation 
turned out to be indeed fairly conservative for all rota¬ 
tion gates we had found specific sequences for. Following 
the above stable relative fractions of approximately 40% 
T gates, 40% H gates, and 20% S gates in the approx¬ 
imating sequences found, we roughly assume that, on 
average, each arbitrary rotation in fact consists of 40 T 
gates, 40 H gates and 20 S gates. 

Taking an implementation accuracy e = 0.01 for each 
single-qubit rotation gate is not sufficient to guarantee 
accuracy e = 0.01 for the entire algorithm. To achieve 
the latter, we would typically require a much smaller tar¬ 
get accuracy for the implementation of single-qubit rota¬ 
tion gates. If the entire algorithm consists of nn single¬ 
qubit rotations, requiring a target accuracy e' = ejupi 
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for each rotation would be an obvious choice. This is a 
fairly conservative error bound though, presuming that 
all rotations are performed in a sequence, with errors in 
different rotations adding up, never canceling each other 
out, and disregarding any parallelism in their implemen¬ 
tations. However, errors may cancel each other out dur¬ 
ing the mostly sequential implementation of the gates. 

The LRE analysis of the bare algorithm excluding or¬ 
acle resources revealed roughly ur ~ 10^^ single-qubit 
rotations (with non-trivial angles of rotation), most of 
which have to be performed sequentially, as implied by 
the distinct lack of parallelism in the design of QLSA. 
According to Fowler’s analysis, the number of standard 
gates needed on average to implement (decompose) a 
single-qubit rotation with accuracy e' = e/riR is approx¬ 
imately: I = cf. Eq. ( 051 ) . Inserting the 

values Ur Ri 10^^ and e = 0.01 yields I « 480, which 
is less than by a factor 5 larger than what we assumed 
for our LRE analysis. Hence, while our LRE results in 
Table in provide gate counts for what is necessary (not 
sufficient) to achieve an accuracy e = 0.01 for the entire 
algorithm, the more conservative error bound e' = c/ur 
for the target rotation accuracy (to guarantee the accu¬ 
racy e for the whole algorithm) would yield estimates for 
H, S, and T gates as well as T-depth that are only by a 
factor ^ 5 larger. The overall gate count and overall cir¬ 
cuit depth would also be increased by a slightly smaller 
factor close to 5. 


2. Circuits and resource estimates of lower-level 
subroutines and multi-qubit gates 
employed by QLSA 


Here we review some well-known circuit decomposi¬ 
tions of various multi-qubit gates in terms of the stan¬ 
dard set of elementary gates {X, Y, Z, H, S,T, CNOT} 
and their associated resource counts that have been used 
for our QLSA LRE analysis. 


bO 



bO 




FIG. 30. Implementation of controlled-H gate in terms of 
CNOTs and single-qubit rotations. 


plementation can be further decomposed into sequences 
consisting only of T, S and H gates: Rziir) = T'^ = 
= Z, Rzi-ir) = = Z, i?;;(7r/4) = SHTSHXZS 

and Ry(-7r/4) = S'tZAiLS'tTtHS't. 


c. Controlled rotations 


Controlled single-qubit rotations Rz(0) can be imple¬ 
mented in terms of CNOTs and unconditional single¬ 
qubit rotations according to circuit equality provided 
in Fig. inn In the case of controlled single-qubit ro- 



FIG. 31. Implementation of controlled single-qubit rotation 
Rz{8) in terms of unconditional single-qubit rotations and 
GNOTs. 

tations Ry{8) we can use the circuit identity shown in 
Fig. 1321 A similar implementation can be derived for con¬ 
trolled single-qubit rotations Rx{0). Moreover, doubly- 


— Ry{ 8 ) — —0 ) 


FIG. 32. Implementation of controlled single-qubit rotation 
Ry{0) in terms of controlled single-qubit rotation Rz{8) and 
standard single-qubit gates. 


a. Controlled-Z gate 

Controlled-Z gate can be decomposed into two H gates 
and one CNOT according to Fig[^ 




FIG. 29. Controlled-gate in terms of standard gates. 


controlled rotations can be implemented in terms of Tof- 
folis, CNOTs and unconditional single-qubit rotations ac¬ 
cording to circuit equality given in Fig. 1551 



FIG. 33. Implementation of doubly-controlled single-qubit 
rotation Rz{8) in terms of Toffolis, CNOTs and unconditional 
single-qubit rotations. 


b. Controlled-H gate 

Controlled-7L gate can be implemented in terms of 
standard gates by using the circuit equality given in 
Fig. 1301 The single-qubit rotations employed in this im- 


d. Quantum Fourier Transform (QFT) 

Both Quantum Fourier Transform (QFT) and its in¬ 
verse QFT“^ are employed in the implementation of 
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QLSA. QFT and its representation in terms of a quan¬ 
tum circuit are discussed in most introductory textbooks 
on quantum computation, see e.g. [l|. A circuit imple¬ 
mentation of QFT“^ is shown in Fig. [Ml 



Rm — [h] 



FIG. 34. Quantum circuit to implement QFT ^ acting on b 
qubits, where Rk := (o exp(2°i/2'=))' 


Here we expand on elementary resource requirements 
of QFT (and its inverse QFT“^). Let 6 > 2 be the 
number of qubits the QFT (or its inverse) acts on, as 
in Fig. 1341 Using the circuit decomposition rule for con¬ 
trolled rotations discussed in appendix 1 2 cl we can derive 
the circuit identity shown in Fig. 1351 Using this circuit 
identity rule, we can express the logical resource require¬ 
ments in terms of standard gates and unconditional Rk 
gates, see Table Hill The latter can then be implemented 
in terms of approximating sequences consisting only of 
fault-tolerant gates from the set {X, Y, Z, H, S,T}, as dis¬ 
cussed in Appendix [TJ 


e. Toffoli gate 


Toffoli gate (essentially a CCNOT) can be imple¬ 
mented (cf., e.g., i) by a circuit using 6 CNOT gates, 
1 S gate, 7 T (or T^) gates and 2 Hadamard gates, and 
having circuit depth 12, see Fig. [36l 


191) 

192) 

193) 




Tt 


& 


FIG. 36. Decomposition of Toffoli gate in terms of standard 
set of gates. 


/. Multi-controlled NOT 

A multi-fold CNOT that is controlled by n > 3 qubits 
can be implemented by 2(n — 2) + I Toffoli gates, which 
must be performed sequentially, and employing (n — 2) 
additional ancilla qubits [s^ . Using the resources needed 
for Toffoli gates, we can infer the resource count of any 
multi-controlled NOT employing an arbitrary number of 
control qubits and a single target qubit, see Table ITVl 



FIG. 35. Implementing controlled-i?*, gates from circuit in 
Fig. [34]in terms of CNOT’s and unconditional Rk gates. 


Elementary resource 

Resource count 

H gates 

b 

unconditional Rk (or ) 

where k — 3, ..., 6-1-1 
and Rk := ( 0 exp(2ni/2'^) ) 

3(6 — fc -I- 2) for particular k 
|6(6 — 1) in total 

CNOT gates 

6(6-1) 

cicuit width 

6 

cicuit depth 

62 - 1-2 c-depth(Rfc) 

T depth 

2 ELs T-depth(Rfc) 


TABLE III. Resource requirement of QFT (or its inverse 
transformation QFT“^) in terms of standard gates and un¬ 
conditional Rk gates. The number of qubits involved in the 
transformation is denoted by b. The unconditional Rk (or 
Rk^) gates can be approximated by sequences consisting only 
of fault-tolerant gates T, S and H. 


Elementary resource 

resource count 

ancilla qubits 

n — 2 

H gates 

2(2n - 3) 

S gates 

(2n - 3) 

T gates 

7(2n - 3) 

CNOT gates 

6(2n - 3) 

cicuit width 

n -I- 1 

cicuit depth 

12(2n - 3) 

T depth 

6(2n - 3) 

Measurements 

(n-2) 


TABLE IV. Resource requirement of multi-controlled NOT 
employing n control qubits and a single target qubit. 


g. Controlled Phase: C-Phase(c\ (f>o, f) 

The task of the controlled-phase C-Phase(c; ((iq,/), 
which is a lower-level algorithmic building block used 
in the implementations of the higher-level subroutines 
‘StatePrep_b’, ‘StatePrep_R’ and ‘HamiltonianSimula- 
tion’ (see Fig. [2]), is to apply a phase shift to a signed 
n-qubit input register c, whereby the applied phase is 
controlled by c itself; 

n—2 

|c) ^ |c) , with e = ^ 2Vo<5eW.i . (27) 

2=0 
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c[0] - 


c[l]- 

O 


U 

c[n-2]- 

O 

C/3 

c3 

c[n-l]- 

cu 

1 

u 




FIG. 37. Quantum circuit to implement the controlled phase subroutine C-Phase(c; 4>o, /). 


Note, that the first c-register qubit c[0] signifies the least 
significant bit corresponding to the minimum phase shift 
(/)o, whereas the qubit c[n — 2] determines the most sig¬ 
nificant bit. Moreover, the last c-register qubit c[n — 1] 
controls the sign of the applied phase. To implement in¬ 
verse operations, it is conditionally flipped by a classical 
integer flag / G {0,1}; for / = 1 the phase should be 
inverted. The quantum circuit is provided in Fig. [33 
When employed as part of the subroutine 
M =Hmag(x, y, m,0o), the controlled-phase 
C-Phase(c; (}io,/) is in addition to be controlled by 
a single-qubit t[}] that is part of the ni-qubit HS 
control register t, see Figs. [T^ and (381 For the LREs of 


Elementary resource 

resource count 

ancilla qubits 

1 

H gates 

80(n - 1) 

S gates 

o 

1 

T gates 

80(n - 1) 

X gates 

4 + 2/ 

CNOT gates 

2 n 

cicuit width 

n + 1 

cicuit depth 

202 (n - 1) + 6 

T depth 

80(n - 1) 

Measurements 

1 


tUl- 


c[0]- 

c[ll- 

c[n-2]- 

c[n-l]- 






















ancilla a 


FIG. 38. Quantum circuit to implement the controlled-phase 
subroutine C-Phase(c; </)o,/), that is in addition further con¬ 
trolled by a single-qubit control register t[j]. 

C-Phase(c; (/)o,/) and C-Phase that is further controlled 
by a single-qubit t[}], we utilized the circuit decomposi¬ 
tion rules discussed in the previous appendix sections. In 
particular, we used the rough (and rather conservative) 
assumption that, on average, every (unconditional) 
single-qubit rotation gate can be approximated by 
sequences of approx. 100 fault-tolerant gates with each 
sequence roughly consisting of 40 T gates, 40 H gates and 
20 S gates, see appendix FT] The LREs of unconditional 
C-Phase and conditional C-Phase are summarized in 
Tables |V] and IVII 


TABLE V. Resource estimates for the unconditional 
C-Phase(c; (/iQ,/) subroutine implemented by circuit given in 
Fig. 1371 where c is an n-qubit register with n G N, and 
/ G {0,1} a classical integer flag. 


Elementary resource 

resource count 

ancilla qubits 

1 

H gates 

164(n - 1) 

S gates 

82(n - 1) 

T gates 

174(n - 1) 

X gates 

4 + 2/ 

CNOT gates 

16(n-l) + 2 

cicuit width 

n + 2 

cicuit depth 

436(n - 1) + 6 

T depth 

174(n - 1) 

Measurements 

1 


TABLE VI. Resource estimates for the conditional 
C-Phase(c; (/iQ,/) subroutine implemented by circuit in 
Fig. 1381 where c is an n-qubit register with n G N, and 
/ G {0,1} a classical integer flag. 


h. Controlled-RotY: C-RotY{c,t-, (j)o, f) 

The task of the subroutine C-RotY(c, t; ((jq, /), which 
is used in the implementation of higher-level subroutines 
‘StatePrep_b’, ‘StatePrep_R’ and ‘Solve_a;’, is to apply a 


single-qubit rotation Ry{0) to a single-qnhit target reg¬ 
ister t, where the angle of rotation 6 is controlled by a 
signed n-qubit input register c; 

n—2 

|t) ^ with 9 = ^ • (28) 

2=0 




































































































34 


The first c-register qubit c[0] signifies the least significant 
bit corresponding to the minimum angle of rotation 
whereas the qubit c[n — 2] determines the most signifi¬ 
cant bit. The sign of the applied rotation is controlled by 
the last c-register qubit c[n — 1]. In addition, it is con¬ 
ditionally flipped by a classical integer flag / € {0,1} to 
enable straightforward inverse operations. The quantum 
circuit is provided in Fig. 1351 For the LRE of subroutine 
C-RotY(c,t; (pQ, /), we utilized the circuit decomposition 
rules discussed in the previous appendix sections; our es¬ 
timates are summarized in Table IVTIl 


l9i> 

| 92 > 



FIG. 40. Definition of the two-qubit ‘VF-gate’ and its imple¬ 
mentation in terms of CNOTs and single-qubit rotations. 


3. Resource estimates for the Oracles 



FIG. 39. Quantum circuit to implement the subroutine 
G-RotY(c, t; (^ 0 ,/), employing an n-qubit control register c 
and a single-qubit target register t. The classical integer flag 
/ € {0,1} facilitates inverse transformations. 


Elementary resource 

resource count 

ancilla qubits 

0 

H gates 

84(n - 1) 

S gates 

42(n - 1) 

T gates 

00 

O 

1 

X gates 

2 / 

GNOT gates 

2n 

cicuit width 

n-\- 1 

cicuit depth 

202 (n - 1)-b 2/ 

T depth 

80(n - 1) 

Measurements 

1 


TABLE VII. Resource estimates for G-RotY(c, t; t/io,/) sub¬ 
routine, whose quantum circuit is shown in Fig. 1391 where c 
is an n-qubit input register with n G N, and / G {0,1}. 


Below we report our LRE results for some representa¬ 
tive oracle queries; all other oracle queries have similar 
resource counts. These results depend on several choices: 
the internal representation for real and integer numbers, 
the details of the linear-system problem definition, and 
the method for generating oracles. As for the internal 
representation of numbers, since every single operation 
had to be built from scratch, we used fixed-point repre¬ 
sentation. Compared to a floating-point representation, 
it is simpler and therefore generates smaller circuits. Re¬ 
garding the details of the linear-system problem defini¬ 
tion, they constitute the core data of this particular im¬ 
plementation of QLSA; provided in the GFI, we made no 
effort to modify them. Finally, the oracles were generated 
with an automated tool, turning a classical description 
of an algorithm into a reversible quantum circuit. We 
made this choice because we felt that it was the most 
natural (and practical) solution for the particular kind 
of oracles we were dealing with: general functions over 
real and complex numbers. 

Quipper automatically generates recursive decomposi¬ 
tions of oracles down to the level of gates such as ini¬ 
tialization, termination, etc. and controlled-nots (by at 
most one or two wires, each on either true or false). The 
rules for decomposing these gates into the standard-basis 
gates H, S, T, and X, and calculating circuit-depths and 
T-depths are included manually. Our rules for the depths 
are very conservative: we assume sequential executions 
unless we know better strategies. Indeed, optimal-depth 
decompositions are known only for fairly small gates, 
such as e.g. the Toffoli gate. Hence we expect over¬ 
estimates both for circuit- and T-deothJ^. These recur¬ 
sive gate-decomposition rules are coded in the symbolic 
programming software Mathematica for computing the 
final estimates. 

Oracle A returns either the magnitude (argflag = 


i. W-gate 

‘VF-gate’ is a two-qubit gate whose action as well as its 
implementation in terms of standard gates is illustrated 
in Fig. |40l As described above for the ‘controlled-H’ 
gate, the single-qubit rotations Ryiir/^) 

and Ry{—Tr/A) can be further decomposed in terms of 
sequences consisting only of T, S and H gates. 


As discussed previously, our circuit-implementations of oracles 
are essentially the trace of execution of a classical program of an 
algorithm. Because the algorithms we used are purely sequential, 
the corresponding quantum circuits are not easily parallelizable 
on a global scale. The only possible optimizations are purely 
local. We therefore conclude that our computed circuit- and T- 
depth values are over-estimates by some unknown small factor 
wrt. optimal depth values. 
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False) or the phase (argflag = True) of the coupling 
weight and the connected node index at the chosen 
matrix-decomposition-band index (from 1 to = 9). 
As there are many combinations, we will show a repre¬ 
sentative sample and will draw conclusions from them. 
As is evident from Table IVTlIl that the estimates for dif¬ 
ferent bands in the argflag = False cases all agree to 
the sub-one-percent level, or to three significant figures, 
with the exception of the number of qubits which only 
agree to within about three percents of each other, or 
to two significant figures. Therefore anyone of them can 
be taken as a representative for all argflag = False or¬ 
acle A resource estimates and a representative table is 
also presented. Similar phenomenon is true for all the 
argflag = True cases and only a representative table is 
presented for them. As gate decompositions used are to 
the basis-gate level, the number of ancillas and measure¬ 
ments should agree in every case, each with individual 
band index and argflag. This is indeed true in all cases 
for which we have performed resource counting. The two 
representative tables for argflag = False and argflag 
= True are presented in Table llXl Finally, the resource 
counts for Oracle r and for Oracle b are done similarly: 
Quipper gives logical resource estimates, then recursive 
gate-decomposition rules are coded in the symbolic com¬ 
puting software Mathematica for computing the final es¬ 
timates presented in Table Kl 

One may wonder why our oracle implementations re¬ 
quire such a huge number of auxiliary qubits and mea¬ 
surements - namely, up to ^ 10® ancilla qubits and mea¬ 
surements for a problem size N ps 3 x 10®. This in¬ 
deed is a feature of our low-level implementation of the 
irreversible-to-reversible transformations that is similar 


to the way “logical reversibility of computation” was pro¬ 
posed by Bennett in [s^. In essence, to ensure that the 
run of the entire computation can be unwound, the re¬ 
sult of each of its elementary sub-computations is stored 
in an auxiliary qubit. When the final result has been 
computed, it is copied into a fresh quantum register, 
and the entire computation is reversed, with every sub¬ 
computation undone along the way, and the initial values 
‘0’ of the intermediate auxiliary qubits restored and ver¬ 
ified by a measurement. The number of auxiliary qubits 
required is therefore directly proportional to the num¬ 
ber of elementary computational steps, and thus to the 
number of gates in the oracle. And the number of mea¬ 
surements needed to ensure reversibility of computation 
equals the number of ancilla qubits. One might argue 
that such an implementation is unnecessarily verbose. 
While we agree that there may be more efficient imple¬ 
mentations (e.g., by using some known efficient adders 
when performing addition), we note that for arbitrary 
computations there is no known ‘efficient’ way of imple¬ 
menting such an oracle. Indeed, our oracle implemen¬ 
tations yield a first baseline count, and they also show 
that more research needs to be done on the generation 
of reversible quantum circuits. On the other hand, our 
proposed implementation is arguably not so inefficient, in 
the sense that the size of the circuit (and therefore also 
the number of auxiliary qubits) is directly proportional 
(and not, say, exponential) to the length of the classical 
computation that would compute the data. In particu¬ 
lar, the size of the circuit for the oracle computing an 
element of the matrix A is linear in the number of bits 
required to store the size of the matrix. 
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